checking for duplicates

Upload: mrlogan123

Post on 04-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Checking for Duplicates

    1/16

    Checking for Duplicates

    On any version of SQL Server, you can identify duplicates using a simple query, with GROUP BYandHAVING, as follows:

    DECLARE@table TABLE (data VARCHAR(20))

    INSERTINTO@table VALUES ('not duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    SELECTdata

    ,COUNT(data)nr

    FROM@table

    GROUPBYdata

    HAVINGCOUNT(data)>1

    Removing Duplicate Rows in SQL Server

    The following sections present a variety of techniques for removing duplicates from SQL Server databasetables, depending on the nature of the table design.

    Tables with no primary key

    When you have duplicates in a table that has no primary key defined, and you are using an older versionof SQL Server, such as SQL Server 2000, you do not have an easy way to identify a single row.Therefore, you cannot simply delete this row by specifying a WHEREclause in a DELETEstatement.

    You can, however, use the SET ROWCOUNT 1command, which will restrict the subsequent DELETEstatement to removing only one row. For example:

    DECLARE@table TABLE (data VARCHAR(20))

    INSERTINTO@table VALUES ('not duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    SETROWCOUNT1

    DELETEFROM@table WHEREdata ='duplicate row'

    SETROWCOUNT0

  • 8/13/2019 Checking for Duplicates

    2/16

    In the above example, only one row is deleted. Consequently, there will be one remaining row with thecontent duplicate row. If you have more than one duplicate of a particular row, you would simply adjustthe ROWCOUNTaccordingly. Note that after the delete, you should reset the ROWCOUNTto 0 so thatsubsequent queries are not affected.

    To remove all duplicates in a single pass, the following code will work, but is likely to be horrendously

    slow if there are a large number of duplicates and table rows:

    DECLARE@table TABLE (data VARCHAR(20))

    INSERTINTO@table VALUES ('not duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    SETNOCOUNTON

    SETROWCOUNT1

    WHILE1 =1

    BEGIN

    DELETEFROM@table

    WHEREdata IN(SELECTdata

    FROM@table

    GROUPBYdata

    HAVINGCOUNT(*)>1)

    IF@@Rowcount=0

    BREAK;

    END

    SETROWCOUNT0

    When cleaning up a table that has a large number of duplicate rows, a better approach is to select just adistinct list of the duplicates, delete all occurrences of those duplicate entries from the original and theninsert the list into the original table.

    DECLARE@table TABLE(data VARCHAR(20))

    INSERTINTO@table VALUES ('not duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

  • 8/13/2019 Checking for Duplicates

    3/16

    INSERTINTO@table VALUES ('duplicate row')

    INSERTINTO@table VALUES ('second duplicate row')

    INSERTINTO@table VALUES ('second duplicate row')

    SELECTdata

    INTO#duplicates

    FROM@table

    GROUPBYdata

    HAVINGCOUNT(*)>1

    -- delete all rows that are duplicated

    DELETEFROM@table

    FROM@table o INNERJOIN#duplicates d

    ONd.data =o.data

    -- insert one row for every duplicate set

    INSERTINTO@table(data)

    SELECTdata

    FROM#duplicates

    As a variation of this technique, you could select all the data, without duplicates, into a new table, deletethe old table, and then rename the new table to match the name of the original table:

    CREATETABLEduplicateTable3(data VARCHAR(20))

    INSERTINTOduplicateTable3 VALUES ('not duplicate row')

    INSERTINTOduplicateTable3 VALUES ('duplicate row')

    INSERTINTOduplicateTable3 VALUES ('duplicate row')

    INSERTINTOduplicateTable3 VALUES ('second duplicate row')

    INSERTINTOduplicateTable3 VALUES ('second duplicate row')

    SELECTDISTINCTdata

    INTOtempTable

  • 8/13/2019 Checking for Duplicates

    4/16

    FROMduplicateTable3

    GO

    TRUNCATETABLEduplicateTable3

    DROPTABLEduplicateTable3

    execsp_rename'tempTable','duplicateTable3'

    In this solution, the SELECT DISTINCTwill select all the rows from our table except for the duplicates.These rows are immediately inserted into a table namedtempTable. This is a temporary table in thesense that we will use it to temporarily store the unique rows. However, it is not a true temporary table(i.e. one that lives in the temporary database), because we need the table to exist in the currentdatabase, so that it can later be renamed, using sp_Rename.

    The sp_Rename commandis an absolutely horrible way of renaming textual objects, such as storedprocedures, because it does not update all the system tables consistently. However, it works well for non-textual schema objects, such as tables.

    Note that this solution is usually used on table that has no primary key. If there is a key, and there areforeign keys referencing the rows that are identified as being duplicates, then the foreign key constraintsneed to be dropped and re-created again during the table swap.

    Tables with a primary key, but no foreign key constraints

    If your table has a primary key, but no foreign key constraints, then the following solution offers a way toremove duplicates that is much quicker, as it entails less iteration:

    DECLARE@table TABLE(

    id INTIDENTITY(1,1)

    ,data VARCHAR(20)

    )

    INSERTINTO@table VALUES ('not duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    INSERTINTO@table VALUES ('duplicate row')

    WHILE1 =1

    BEGIN

    DELETEFROM@table

    WHEREid IN(SELECTMAX(id)

  • 8/13/2019 Checking for Duplicates

    5/16

  • 8/13/2019 Checking for Duplicates

    6/16

    )f ONo.data =f.data

    LEFTOUTERJOIN(SELECT[id] =MAX(id)

    FROM@table

    GROUPBYdata

    HAVINGCOUNT(*)>1

    )g ONo.id =g.id

    WHEREg.id ISNULL

    This can be simplified even further, though the logic is rather harder to follow.

    DELETEFROMf

    FROM@tableASf INNERJOIN@tableASg

    ONg.data =f.data

    ANDf.id

  • 8/13/2019 Checking for Duplicates

    7/16

    INSERTINTOduplicateTable4 VALUES ('not duplicate row')

    INSERTINTOduplicateTable4 VALUES ('duplicate row')

    INSERTINTOduplicateTable4 VALUES ('duplicate row')

    INSERTINTOduplicateTable4 VALUES ('second duplicate row')

    INSERTINTOduplicateTable4 VALUES ('second duplicate row')

    SELECTIDENTITY(INT,1,1 )ASid,

    data

    INTOduplicateTable4_Copy

    FROMduplicateTable4

    The above will create the duplicateTable4_Copytable. This table will have an identity column named id,which will already have unique numeric values set. Note that although we are creating an Identity column,uniqueness is not enforced in this case; you will need to add a unique index or define the idcolumn as aprimary key.

    Using a cursor

    People with application development background would consider using a cursor to try to eliminateduplicates. The basic idea is to order the contents of the table, iterate through the ordered rows, andcheck if the current row is equal to the previous row. If it does, then delete the row. This solution couldlook like the following in T-SQL:

    CREATETABLEduplicateTable5(data varchar(30))

    INSERTINTOduplicateTable5 VALUES ('not duplicate row')

    INSERTINTOduplicateTable5 VALUES ('duplicate row')

    INSERTINTOduplicateTable5 VALUES ('duplicate row')

    INSERTINTOduplicateTable5 VALUES ('second duplicate row')

    INSERTINTOduplicateTable5 VALUES ('second duplicate row')

    DECLARE@data VARCHAR(30),

    @previousData VARCHAR(30)

    DECLAREcursor1 CURSORSCROLL_LOCKS

    FORSELECTdata

    FROMduplicateTable5

  • 8/13/2019 Checking for Duplicates

    8/16

    ORDERBYdata

    FORUPDATE

    OPENcursor1

    FETCHNEXTFROMcursor1 INTO@data

    WHILE@@FETCH_STATUS=0

    BEGIN

    IF@previousData =@data

    DELETEFROMduplicateTable5

    WHERECURRENTOFcursor1

    SET@previousData =@data

    FETCHNEXTFROMcursor1 INTO@data

    END

    CLOSEcursor1

    DEALLOCATEcursor1

    The above script will not work, because once you apply the ORDER BYclause in the cursor declarationthe cursor will become read-only. If you remove the ORDER BYclause, then there will be no guarantee

    that the rows will be in order, and checking two subsequent rows would no longer be sufficient to identifyduplicates. Interestingly, since the above example creates a small table where all the rows fit onto asingle database page and duplicate rows are inserted in groups, removing the ORDER BYclause doesmake the cursor solution work. It will fail, however, with any table that is larger and has seen somemodifications.

    ProblemIn data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP(On Line Transaction Processing) applications we are often encountered with duplicate records in ourtable. To make the table data consistent and accurate we need to get rid of these duplicate recordskeeping only one of them in the table. In this tip I discuss different strategies which you can take for this,

    along with the pros and cons.

    SolutionThere are different methods for deleting duplicate (de-duplication) records from a table, each of them hasits own pros and cons. I am going to discuss these methods, prerequisite of each of these methods alongwith its pros and cons.

    1. Using correlated subquery2. Using temporary table

  • 8/13/2019 Checking for Duplicates

    9/16

  • 8/13/2019 Checking for Duplicates

    10/16

    SELECT*FROMEmployeeGO--Selecting distinct recordsSELECT*FROMEmployee E1WHEREE1.ID =(SELECTMAX(ID)FROMEmployee E2WHEREE2.FirstName =E1.FirstNameANDE1.LastName =E2.LastName

    ANDE1.Address=E2.Address)GO--Deleting duplicatesDELETEEmployeeWHEREID

  • 8/13/2019 Checking for Duplicates

    11/16

  • 8/13/2019 Checking for Duplicates

    12/16

    BEGINTRAN-- Pull distinct records in a new tableSELECTDISTINCT*INTOEmployeeNewFROMEmployee--Drop the old target tableDROPTABLEEmployee

    --rename the new tableEXECsp_rename'EmployeeNew','Employee'COMMITTRANGOSELECT*FROMEmployeeGO

    4. Using Common Table Expression (CTE)

    SQL Server 2005 introducedCommon Table Expression (CTE)which acts as a temporary result set thatis defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATEVIEW statement.

    In this example I am using a CTE for de-duplication. I am using the ROW_NUMBER function to return thesequential number of each row within a partition of a result set which is a grouping based on [FirstName],[LastName], [Address] columns (or columns of the table) and then I am deleting all records except wherethe sequential number is 1. This means keeping one record from the group and deleting all othersimilar/duplicate records. This is one of the efficient methods to delete records and I would suggest usingthis if you have SQL Server 2005 or 2008.

    Script #5 - Using CTE for de-duplication

    --example1WITHCTEAS

    (SELECTROW_NUMBER()OVER(PARTITIONBY[FirstName],[LastName],[Address]OrderBY[FirstName] DESC,[LastName] DESC,[Address] DESC)ASRowNumberFROMEmployee tblWHEREEXISTS(SELECTTOP1 1 FROM (SELECTFirstName,LastName,AddressFROMEmployeeGROUPBY[FirstName],[LastName],[Address] HAVINGCOUNT(*)>1 )GrpTableWHEREGrpTable.FirstName =tbl.FirstNameANDGrpTable.LastName =tbl.LastNameANDGrpTable.Address=tbl.Address))DELETEFROMCTE WhereRowNumber >1GOSELECT*FROMEmployeeGO

    --A more simplified and faster example

    WITHCTEAS(SELECTROW_NUMBER()OVER

    http://www.mssqltips.com/category.asp?catid=62http://www.mssqltips.com/category.asp?catid=62http://www.mssqltips.com/category.asp?catid=62http://www.mssqltips.com/category.asp?catid=62
  • 8/13/2019 Checking for Duplicates

    13/16

    (PARTITIONBY[FirstName],[LastName],[Address]OrderBY[FirstName] DESC,[LastName] DESC,[Address] DESC)ASRowNumber,[FirstName],[LastName],[Address]FROMEmployee tbl )DELETEFROMCTE WhereRowNumber >1

    GOSELECT*FROMEmployeeGO

    5. Using Fuzzy Group Transformation in SSIS

    If you are using SSIS to upload data to your target table, you can use a Fuzzy Grouping Transformationbefore inserting records to the destination table to ignore duplicate records and insert only uniquerecords. Here, in the image below, you can see 9 records are coming from source, but only 3 records arebeing inserted into the target table, that's because only 3 records are unique out of the 9 records. Refer toScript #2 above to see more about these 9 records that were used.

    In the Fuzzy Grouping Transformation editor, on the Columns tab you specify the columns which you

    want to be included in grouping. As you can see in the below image I have chosen all 3 columns in myconsideration for grouping.

    http://msdn.microsoft.com/en-us/library/ms141764.aspxhttp://msdn.microsoft.com/en-us/library/ms141764.aspxhttp://msdn.microsoft.com/en-us/library/ms141764.aspxhttp://msdn.microsoft.com/en-us/library/ms141764.aspx
  • 8/13/2019 Checking for Duplicates

    14/16

    In the Fuzzy Grouping Transformation, you might add a conditional split to direct unique rows or duplicaterows to two destinations. Here in the example you can see I am routing all the unique rows to thedestination table and ignoring the duplicate records. The Fuzzy Grouping Transformation produces a fewadditional columns like_key_inwhich uniquely identifies each rows,_key_outwhich identifies a group ofduplicate records etc.

  • 8/13/2019 Checking for Duplicates

    15/16

  • 8/13/2019 Checking for Duplicates

    16/16