checking for duplicates
TRANSCRIPT
-
8/13/2019 Checking for Duplicates
1/16
Checking for Duplicates
On any version of SQL Server, you can identify duplicates using a simple query, with GROUP BYandHAVING, as follows:
DECLARE@table TABLE (data VARCHAR(20))
INSERTINTO@table VALUES ('not duplicate row')
INSERTINTO@table VALUES ('duplicate row')
INSERTINTO@table VALUES ('duplicate row')
SELECTdata
,COUNT(data)nr
FROM@table
GROUPBYdata
HAVINGCOUNT(data)>1
Removing Duplicate Rows in SQL Server
The following sections present a variety of techniques for removing duplicates from SQL Server databasetables, depending on the nature of the table design.
Tables with no primary key
When you have duplicates in a table that has no primary key defined, and you are using an older versionof SQL Server, such as SQL Server 2000, you do not have an easy way to identify a single row.Therefore, you cannot simply delete this row by specifying a WHEREclause in a DELETEstatement.
You can, however, use the SET ROWCOUNT 1command, which will restrict the subsequent DELETEstatement to removing only one row. For example:
DECLARE@table TABLE (data VARCHAR(20))
INSERTINTO@table VALUES ('not duplicate row')
INSERTINTO@table VALUES ('duplicate row')
INSERTINTO@table VALUES ('duplicate row')
SETROWCOUNT1
DELETEFROM@table WHEREdata ='duplicate row'
SETROWCOUNT0
-
8/13/2019 Checking for Duplicates
2/16
In the above example, only one row is deleted. Consequently, there will be one remaining row with thecontent duplicate row. If you have more than one duplicate of a particular row, you would simply adjustthe ROWCOUNTaccordingly. Note that after the delete, you should reset the ROWCOUNTto 0 so thatsubsequent queries are not affected.
To remove all duplicates in a single pass, the following code will work, but is likely to be horrendously
slow if there are a large number of duplicates and table rows:
DECLARE@table TABLE (data VARCHAR(20))
INSERTINTO@table VALUES ('not duplicate row')
INSERTINTO@table VALUES ('duplicate row')
INSERTINTO@table VALUES ('duplicate row')
SETNOCOUNTON
SETROWCOUNT1
WHILE1 =1
BEGIN
DELETEFROM@table
WHEREdata IN(SELECTdata
FROM@table
GROUPBYdata
HAVINGCOUNT(*)>1)
IF@@Rowcount=0
BREAK;
END
SETROWCOUNT0
When cleaning up a table that has a large number of duplicate rows, a better approach is to select just adistinct list of the duplicates, delete all occurrences of those duplicate entries from the original and theninsert the list into the original table.
DECLARE@table TABLE(data VARCHAR(20))
INSERTINTO@table VALUES ('not duplicate row')
INSERTINTO@table VALUES ('duplicate row')
-
8/13/2019 Checking for Duplicates
3/16
INSERTINTO@table VALUES ('duplicate row')
INSERTINTO@table VALUES ('second duplicate row')
INSERTINTO@table VALUES ('second duplicate row')
SELECTdata
INTO#duplicates
FROM@table
GROUPBYdata
HAVINGCOUNT(*)>1
-- delete all rows that are duplicated
DELETEFROM@table
FROM@table o INNERJOIN#duplicates d
ONd.data =o.data
-- insert one row for every duplicate set
INSERTINTO@table(data)
SELECTdata
FROM#duplicates
As a variation of this technique, you could select all the data, without duplicates, into a new table, deletethe old table, and then rename the new table to match the name of the original table:
CREATETABLEduplicateTable3(data VARCHAR(20))
INSERTINTOduplicateTable3 VALUES ('not duplicate row')
INSERTINTOduplicateTable3 VALUES ('duplicate row')
INSERTINTOduplicateTable3 VALUES ('duplicate row')
INSERTINTOduplicateTable3 VALUES ('second duplicate row')
INSERTINTOduplicateTable3 VALUES ('second duplicate row')
SELECTDISTINCTdata
INTOtempTable
-
8/13/2019 Checking for Duplicates
4/16
FROMduplicateTable3
GO
TRUNCATETABLEduplicateTable3
DROPTABLEduplicateTable3
execsp_rename'tempTable','duplicateTable3'
In this solution, the SELECT DISTINCTwill select all the rows from our table except for the duplicates.These rows are immediately inserted into a table namedtempTable. This is a temporary table in thesense that we will use it to temporarily store the unique rows. However, it is not a true temporary table(i.e. one that lives in the temporary database), because we need the table to exist in the currentdatabase, so that it can later be renamed, using sp_Rename.
The sp_Rename commandis an absolutely horrible way of renaming textual objects, such as storedprocedures, because it does not update all the system tables consistently. However, it works well for non-textual schema objects, such as tables.
Note that this solution is usually used on table that has no primary key. If there is a key, and there areforeign keys referencing the rows that are identified as being duplicates, then the foreign key constraintsneed to be dropped and re-created again during the table swap.
Tables with a primary key, but no foreign key constraints
If your table has a primary key, but no foreign key constraints, then the following solution offers a way toremove duplicates that is much quicker, as it entails less iteration:
DECLARE@table TABLE(
id INTIDENTITY(1,1)
,data VARCHAR(20)
)
INSERTINTO@table VALUES ('not duplicate row')
INSERTINTO@table VALUES ('duplicate row')
INSERTINTO@table VALUES ('duplicate row')
WHILE1 =1
BEGIN
DELETEFROM@table
WHEREid IN(SELECTMAX(id)
-
8/13/2019 Checking for Duplicates
5/16
-
8/13/2019 Checking for Duplicates
6/16
)f ONo.data =f.data
LEFTOUTERJOIN(SELECT[id] =MAX(id)
FROM@table
GROUPBYdata
HAVINGCOUNT(*)>1
)g ONo.id =g.id
WHEREg.id ISNULL
This can be simplified even further, though the logic is rather harder to follow.
DELETEFROMf
FROM@tableASf INNERJOIN@tableASg
ONg.data =f.data
ANDf.id
-
8/13/2019 Checking for Duplicates
7/16
INSERTINTOduplicateTable4 VALUES ('not duplicate row')
INSERTINTOduplicateTable4 VALUES ('duplicate row')
INSERTINTOduplicateTable4 VALUES ('duplicate row')
INSERTINTOduplicateTable4 VALUES ('second duplicate row')
INSERTINTOduplicateTable4 VALUES ('second duplicate row')
SELECTIDENTITY(INT,1,1 )ASid,
data
INTOduplicateTable4_Copy
FROMduplicateTable4
The above will create the duplicateTable4_Copytable. This table will have an identity column named id,which will already have unique numeric values set. Note that although we are creating an Identity column,uniqueness is not enforced in this case; you will need to add a unique index or define the idcolumn as aprimary key.
Using a cursor
People with application development background would consider using a cursor to try to eliminateduplicates. The basic idea is to order the contents of the table, iterate through the ordered rows, andcheck if the current row is equal to the previous row. If it does, then delete the row. This solution couldlook like the following in T-SQL:
CREATETABLEduplicateTable5(data varchar(30))
INSERTINTOduplicateTable5 VALUES ('not duplicate row')
INSERTINTOduplicateTable5 VALUES ('duplicate row')
INSERTINTOduplicateTable5 VALUES ('duplicate row')
INSERTINTOduplicateTable5 VALUES ('second duplicate row')
INSERTINTOduplicateTable5 VALUES ('second duplicate row')
DECLARE@data VARCHAR(30),
@previousData VARCHAR(30)
DECLAREcursor1 CURSORSCROLL_LOCKS
FORSELECTdata
FROMduplicateTable5
-
8/13/2019 Checking for Duplicates
8/16
ORDERBYdata
FORUPDATE
OPENcursor1
FETCHNEXTFROMcursor1 INTO@data
WHILE@@FETCH_STATUS=0
BEGIN
IF@previousData =@data
DELETEFROMduplicateTable5
WHERECURRENTOFcursor1
SET@previousData =@data
FETCHNEXTFROMcursor1 INTO@data
END
CLOSEcursor1
DEALLOCATEcursor1
The above script will not work, because once you apply the ORDER BYclause in the cursor declarationthe cursor will become read-only. If you remove the ORDER BYclause, then there will be no guarantee
that the rows will be in order, and checking two subsequent rows would no longer be sufficient to identifyduplicates. Interestingly, since the above example creates a small table where all the rows fit onto asingle database page and duplicate rows are inserted in groups, removing the ORDER BYclause doesmake the cursor solution work. It will fail, however, with any table that is larger and has seen somemodifications.
ProblemIn data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP(On Line Transaction Processing) applications we are often encountered with duplicate records in ourtable. To make the table data consistent and accurate we need to get rid of these duplicate recordskeeping only one of them in the table. In this tip I discuss different strategies which you can take for this,
along with the pros and cons.
SolutionThere are different methods for deleting duplicate (de-duplication) records from a table, each of them hasits own pros and cons. I am going to discuss these methods, prerequisite of each of these methods alongwith its pros and cons.
1. Using correlated subquery2. Using temporary table
-
8/13/2019 Checking for Duplicates
9/16
-
8/13/2019 Checking for Duplicates
10/16
SELECT*FROMEmployeeGO--Selecting distinct recordsSELECT*FROMEmployee E1WHEREE1.ID =(SELECTMAX(ID)FROMEmployee E2WHEREE2.FirstName =E1.FirstNameANDE1.LastName =E2.LastName
ANDE1.Address=E2.Address)GO--Deleting duplicatesDELETEEmployeeWHEREID
-
8/13/2019 Checking for Duplicates
11/16
-
8/13/2019 Checking for Duplicates
12/16
BEGINTRAN-- Pull distinct records in a new tableSELECTDISTINCT*INTOEmployeeNewFROMEmployee--Drop the old target tableDROPTABLEEmployee
--rename the new tableEXECsp_rename'EmployeeNew','Employee'COMMITTRANGOSELECT*FROMEmployeeGO
4. Using Common Table Expression (CTE)
SQL Server 2005 introducedCommon Table Expression (CTE)which acts as a temporary result set thatis defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATEVIEW statement.
In this example I am using a CTE for de-duplication. I am using the ROW_NUMBER function to return thesequential number of each row within a partition of a result set which is a grouping based on [FirstName],[LastName], [Address] columns (or columns of the table) and then I am deleting all records except wherethe sequential number is 1. This means keeping one record from the group and deleting all othersimilar/duplicate records. This is one of the efficient methods to delete records and I would suggest usingthis if you have SQL Server 2005 or 2008.
Script #5 - Using CTE for de-duplication
--example1WITHCTEAS
(SELECTROW_NUMBER()OVER(PARTITIONBY[FirstName],[LastName],[Address]OrderBY[FirstName] DESC,[LastName] DESC,[Address] DESC)ASRowNumberFROMEmployee tblWHEREEXISTS(SELECTTOP1 1 FROM (SELECTFirstName,LastName,AddressFROMEmployeeGROUPBY[FirstName],[LastName],[Address] HAVINGCOUNT(*)>1 )GrpTableWHEREGrpTable.FirstName =tbl.FirstNameANDGrpTable.LastName =tbl.LastNameANDGrpTable.Address=tbl.Address))DELETEFROMCTE WhereRowNumber >1GOSELECT*FROMEmployeeGO
--A more simplified and faster example
WITHCTEAS(SELECTROW_NUMBER()OVER
http://www.mssqltips.com/category.asp?catid=62http://www.mssqltips.com/category.asp?catid=62http://www.mssqltips.com/category.asp?catid=62http://www.mssqltips.com/category.asp?catid=62 -
8/13/2019 Checking for Duplicates
13/16
(PARTITIONBY[FirstName],[LastName],[Address]OrderBY[FirstName] DESC,[LastName] DESC,[Address] DESC)ASRowNumber,[FirstName],[LastName],[Address]FROMEmployee tbl )DELETEFROMCTE WhereRowNumber >1
GOSELECT*FROMEmployeeGO
5. Using Fuzzy Group Transformation in SSIS
If you are using SSIS to upload data to your target table, you can use a Fuzzy Grouping Transformationbefore inserting records to the destination table to ignore duplicate records and insert only uniquerecords. Here, in the image below, you can see 9 records are coming from source, but only 3 records arebeing inserted into the target table, that's because only 3 records are unique out of the 9 records. Refer toScript #2 above to see more about these 9 records that were used.
In the Fuzzy Grouping Transformation editor, on the Columns tab you specify the columns which you
want to be included in grouping. As you can see in the below image I have chosen all 3 columns in myconsideration for grouping.
http://msdn.microsoft.com/en-us/library/ms141764.aspxhttp://msdn.microsoft.com/en-us/library/ms141764.aspxhttp://msdn.microsoft.com/en-us/library/ms141764.aspxhttp://msdn.microsoft.com/en-us/library/ms141764.aspx -
8/13/2019 Checking for Duplicates
14/16
In the Fuzzy Grouping Transformation, you might add a conditional split to direct unique rows or duplicaterows to two destinations. Here in the example you can see I am routing all the unique rows to thedestination table and ignoring the duplicate records. The Fuzzy Grouping Transformation produces a fewadditional columns like_key_inwhich uniquely identifies each rows,_key_outwhich identifies a group ofduplicate records etc.
-
8/13/2019 Checking for Duplicates
15/16
-
8/13/2019 Checking for Duplicates
16/16