-- Andy Rooney
"Or the Department of Education and another ministry were worried about duplication of effort, so what did they do? They set up two committees to look into duplication and neither knew what the other was up to. It really is a world beyond parody."
-- Rory Bremner
It happens every now and then, you create a large data set for some type of testing and since you are trying to "go faster" you do not bother with automating the process and some how one of the data set insertions gets ran twice. Now you have duplicate data in your data set and need to get rid of it.
Say you have the following table.
CREATE TABLE dbo.PurchaseOrderDetail
(
PurchaseOrderID int NOT NULL,
LineNumber smallint NOT NULL,
ProductID int NULL,
UnitPrice money NULL,
OrderQty smallint NULL,
rowguid uniqueidentifier ROWGUIDCOL NOT NULL
CONSTRAINT DF_PurchaseOrderDetail_rowguid DEFAULT (newid())
);
And the following INSERT statement is ran.
INSERT INTO dbo.PurchaseOrderDetail
(
PurchaseOrderID,
LineNumber,
ProductID,
UnitPrice,
OrderQty
)
VALUES
(100, 1, 1, 12.99, 1)
,(100, 2, 3, 15.99, 1)
,(100, 3, 6, 10.99, 2)
,(101, 2, 1, 12.99, 2)
,(101, 3, 2, 11.99, 1)
,(101, 4, 6, 10.99, 5)
,(102, 1, 1, 12.99, 8)
,(103, 1, 4, 19.99, 1);
Giving you the following data set.
SELECT * FROM dbo.PurchaseOrderDetail
;
PURCHASEORDERID | LINENUMBER | PRODUCTID | UNITPRICE | ORDERQTY | ROWGUID |
---|---|---|---|---|---|
100 | 1 | 1 | 12.99 | 1 | A89225C8-3267-491C-B4E1-8688B902BDD3 |
100 | 2 | 3 | 15.99 | 1 | 528F7FC0-5870-47C7-ACD3-3955424DD742 |
100 | 3 | 6 | 10.99 | 2 | 498A1A36-8698-4374-9861-49DCF6514CF5 |
101 | 2 | 1 | 12.99 | 2 | 59820CAF-9209-4EAD-BDF0-D36170084E3C |
101 | 3 | 2 | 11.99 | 1 | A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30 |
101 | 4 | 6 | 10.99 | 5 | B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936 |
102 | 1 | 1 | 12.99 | 8 | A24CFF12-F402-488C-9B26-8D2AF0710F36 |
103 | 1 | 4 | 19.99 | 1 | 3044E8AC-027A-4B13-9791-40768F2A4F3E |
PURCHASEORDERID | LINENUMBER | PRODUCTID | UNITPRICE | ORDERQTY | ROWGUID |
---|---|---|---|---|---|
100 | 1 | 1 | 12.99 | 1 | A89225C8-3267-491C-B4E1-8688B902BDD3 |
100 | 2 | 3 | 15.99 | 1 | 528F7FC0-5870-47C7-ACD3-3955424DD742 |
100 | 3 | 6 | 10.99 | 2 | 498A1A36-8698-4374-9861-49DCF6514CF5 |
101 | 2 | 1 | 12.99 | 2 | 59820CAF-9209-4EAD-BDF0-D36170084E3C |
101 | 3 | 2 | 11.99 | 1 | A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30 |
101 | 4 | 6 | 10.99 | 5 | B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936 |
102 | 1 | 1 | 12.99 | 8 | A24CFF12-F402-488C-9B26-8D2AF0710F36 |
103 | 1 | 4 | 19.99 | 1 | 3044E8AC-027A-4B13-9791-40768F2A4F3E |
100 | 1 | 1 | 12.99 | 1 | 6D0193CD-198F-4A39-991C-9C1CBED4B7A5 |
100 | 2 | 3 | 15.99 | 1 | 5A007E16-76B3-425A-963E-CD1EB3A8A3D4 |
100 | 3 | 6 | 10.99 | 2 | C01CCEBC-7FDB-4166-8780-6172015949AD |
101 | 2 | 1 | 12.99 | 2 | 24816420-629B-42D4-94EE-D05D922EF553 |
101 | 3 | 2 | 11.99 | 1 | 8605E6A8-9CF4-4575-B8F4-F68E91D793C5 |
101 | 4 | 6 | 10.99 | 5 | BD54EB82-4A20-484D-A5CF-53DCB9A27CB2 |
102 | 1 | 1 | 12.99 | 8 | 03DE0A36-6295-46E6-8F4E-C67E36EC2FAE |
103 | 1 | 4 | 19.99 | 1 | AE017968-1480-49E6-A9B7-510251C8E33A |
Say that it takes a rather long time to create the and load the test data, so a TRUNCATE or DROP and recreate are out of the question. What can we do to remove the duplicate data leaving the original data intact?
What we need is to be able to assign an identifier to each row that would be the antonym for the duplicate rows. ROW_NUMBER with a PARTITION BY natural keys (in this example PurchaseOrderID and LineNumber) would server just this case (since we do not care about the ORDER we can use the SELECT NULL trick). One of the records would have the row number of 1 and the duplicate would have the row number of 2, then we could simply DELETE all the records with the row number of 2.
For identifying the records we can create the following common table expression.
WITH cte AS (
SELECT
PurchaseOrderID
,LineNumber
,ROW_NUMBER()
OVER (PARTITION BY PurchaseOrderID, LineNumber ORDER BY (SELECT NULL)) AS num
FROM dbo.PurchaseOrderDetail
)
Which would give use the following (using SELECT * FROM cte)
PURCHASEORDERID | LINENUMBER | NUM |
---|---|---|
100 | 1 | 1 |
100 | 1 | 2 |
100 | 2 | 1 |
100 | 2 | 2 |
100 | 3 | 1 |
100 | 3 | 2 |
101 | 2 | 1 |
101 | 2 | 2 |
101 | 3 | 1 |
101 | 3 | 2 |
101 | 4 | 1 |
101 | 4 | 2 |
102 | 1 | 1 |
102 | 1 | 2 |
103 | 1 | 1 |
103 | 1 | 2 |
WITH cte AS (
SELECT
PurchaseOrderID
,LineNumber
,ROW_NUMBER()
OVER (PARTITION BY PurchaseOrderID, LineNumber
ORDER BY (SELECT NULL)) AS num
FROM dbo.PurchaseOrderDetail
)
DELETE cte
WHERE num > 1
;
Bingo! We now have a clean data set once again.
SELECT * FROM dbo.PurchaseOrderDetail
;
PURCHASEORDERID | LINENUMBER | PRODUCTID | UNITPRICE | ORDERQTY | ROWGUID |
---|---|---|---|---|---|
100 | 1 | 1 | 12.99 | 1 | A89225C8-3267-491C-B4E1-8688B902BDD3 |
100 | 3 | 6 | 10.99 | 2 | 498A1A36-8698-4374-9861-49DCF6514CF5 |
101 | 3 | 2 | 11.99 | 1 | A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30 |
102 | 1 | 1 | 12.99 | 8 | A24CFF12-F402-488C-9B26-8D2AF0710F36 |
100 | 2 | 3 | 15.99 | 1 | 5A007E16-76B3-425A-963E-CD1EB3A8A3D4 |
101 | 2 | 1 | 12.99 | 2 | 24816420-629B-42D4-94EE-D05D922EF553 |
101 | 4 | 6 | 10.99 | 5 | BD54EB82-4A20-484D-A5CF-53DCB9A27CB2 |
103 | 1 | 4 | 19.99 | 1 | AE017968-1480-49E6-A9B7-510251C8E33A |
You can check out all the code at the following URL: http://sqlfiddle.com/#!6/b2f21/8/5