Monday, February 18, 2013

How to Remove Duplicate Records using Windowing Functions

"Making duplicate copies and computer printouts of things no one wanted even one of in the first place is giving America a new sense of purpose. "
-- Andy Rooney

"Or the Department of Education and another ministry were worried about duplication of effort, so what did they do? They set up two committees to look into duplication and neither knew what the other was up to. It really is a world beyond parody."
-- Rory Bremner

It happens every now and then, you create a large data set for some type of testing and since you are trying to "go faster" you do not bother with automating the process and some how one of the data set insertions gets ran twice.  Now you have duplicate data in your data set and need to get rid of it.

Say you have the following table.

CREATE TABLE dbo.PurchaseOrderDetail
(
    PurchaseOrderID int NOT NULL,
    LineNumber smallint NOT NULL,
    ProductID int NULL,
    UnitPrice money NULL,
    OrderQty smallint NULL,
    rowguid uniqueidentifier ROWGUIDCOL  NOT NULL
        CONSTRAINT DF_PurchaseOrderDetail_rowguid DEFAULT (newid())
);

And the following INSERT statement is ran.

INSERT INTO dbo.PurchaseOrderDetail
(
  PurchaseOrderID,
  LineNumber,
  ProductID,
  UnitPrice,
  OrderQty
)
VALUES
 (100, 1, 1, 12.99, 1)
,(100, 2, 3, 15.99, 1)
,(100, 3, 6, 10.99, 2)
,(101, 2, 1, 12.99, 2)
,(101, 3, 2, 11.99, 1)
,(101, 4, 6, 10.99, 5)
,(102, 1, 1, 12.99, 8)
,(103, 1, 4, 19.99, 1);

Giving you the following data set.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERIDLINENUMBERPRODUCTIDUNITPRICEORDERQTYROWGUID
1001112.991A89225C8-3267-491C-B4E1-8688B902BDD3
1002315.991528F7FC0-5870-47C7-ACD3-3955424DD742
1003610.992498A1A36-8698-4374-9861-49DCF6514CF5
1012112.99259820CAF-9209-4EAD-BDF0-D36170084E3C
1013211.991A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
1014610.995B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
1021112.998A24CFF12-F402-488C-9B26-8D2AF0710F36
1031419.9913044E8AC-027A-4B13-9791-40768F2A4F3E
Now say some bozo runs the insert statement again and you now have the following data set.

PURCHASEORDERIDLINENUMBERPRODUCTIDUNITPRICEORDERQTYROWGUID
1001112.991A89225C8-3267-491C-B4E1-8688B902BDD3
1002315.991528F7FC0-5870-47C7-ACD3-3955424DD742
1003610.992498A1A36-8698-4374-9861-49DCF6514CF5
1012112.99259820CAF-9209-4EAD-BDF0-D36170084E3C
1013211.991A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
1014610.995B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
1021112.998A24CFF12-F402-488C-9B26-8D2AF0710F36
1031419.9913044E8AC-027A-4B13-9791-40768F2A4F3E
1001112.9916D0193CD-198F-4A39-991C-9C1CBED4B7A5
1002315.9915A007E16-76B3-425A-963E-CD1EB3A8A3D4
1003610.992C01CCEBC-7FDB-4166-8780-6172015949AD
1012112.99224816420-629B-42D4-94EE-D05D922EF553
1013211.9918605E6A8-9CF4-4575-B8F4-F68E91D793C5
1014610.995BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
1021112.99803DE0A36-6295-46E6-8F4E-C67E36EC2FAE
1031419.991AE017968-1480-49E6-A9B7-510251C8E33A
The duplicate records have been highlighted.  Since the GUID is created a new every time it is ran, the GUID values are different in the duplicate records, but the all the other data is exactly the same.

Say that it takes a rather long time to create the and load the test data, so a TRUNCATE or DROP and recreate are out of the question.  What can we do to remove the duplicate data leaving the original data intact?

What we need is to be able to assign an identifier to each row that would be the antonym for the duplicate rows.  ROW_NUMBER with a PARTITION BY natural keys (in this example PurchaseOrderID and LineNumber) would server just this case (since we do not care about the ORDER we can use the SELECT NULL trick).  One of the records would have the row number of 1 and the duplicate would have the row number of 2, then we could simply DELETE all the records with the row number of 2.

For identifying the records we can create the following common table expression.

WITH cte AS (
  SELECT
    PurchaseOrderID
   ,LineNumber
   ,ROW_NUMBER()
      OVER (PARTITION BY PurchaseOrderID, LineNumber ORDER BY (SELECT NULL)) AS num
  FROM dbo.PurchaseOrderDetail
)

Which would give use the following (using SELECT * FROM cte)

PURCHASEORDERIDLINENUMBERNUM
10011
10012
10021
10022
10031
10032
10121
10122
10131
10132
10141
10142
10211
10212
10311
10312
Now we have the duplicate records tagged with a row number of 2, so we can just DELETE those records.

WITH cte AS (
  SELECT
    PurchaseOrderID
   ,LineNumber
   ,ROW_NUMBER()
      OVER (PARTITION BY PurchaseOrderID, LineNumber
                   ORDER BY (SELECT NULL)) AS num
  FROM dbo.PurchaseOrderDetail
)
DELETE cte
  WHERE num > 1
;

Bingo!  We now have a clean data set once again.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERIDLINENUMBERPRODUCTIDUNITPRICEORDERQTYROWGUID
1001112.991A89225C8-3267-491C-B4E1-8688B902BDD3
1003610.992498A1A36-8698-4374-9861-49DCF6514CF5
1013211.991A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
1021112.998A24CFF12-F402-488C-9B26-8D2AF0710F36
1002315.9915A007E16-76B3-425A-963E-CD1EB3A8A3D4
1012112.99224816420-629B-42D4-94EE-D05D922EF553
1014610.995BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
1031419.991AE017968-1480-49E6-A9B7-510251C8E33A
To design and test out this example I used SQL Fiddle using SQL Server 2012.

You can check out all the code at the following URL: http://sqlfiddle.com/#!6/b2f21/8/5