Philosophy of Programming: How to Remove Duplicate Records using Windowing Functions

"Making duplicate copies and computer printouts of things no one wanted even one of in the first place is giving America a new sense of purpose. "
-- Andy Rooney

"Or the Department of Education and another ministry were worried about duplication of effort, so what did they do? They set up two committees to look into duplication and neither knew what the other was up to. It really is a world beyond parody."
-- Rory Bremner

It happens every now and then, you create a large data set for some type of testing and since you are trying to "go faster" you do not bother with automating the process and some how one of the data set insertions gets ran twice. Now you have duplicate data in your data set and need to get rid of it.

Say you have the following table.

CREATE TABLE dbo.PurchaseOrderDetail
(
PurchaseOrderID int NOT NULL,
LineNumber smallint NOT NULL,
ProductID int NULL,
UnitPrice money NULL,
OrderQty smallint NULL,
rowguid uniqueidentifier ROWGUIDCOL NOT NULL
CONSTRAINT DF_PurchaseOrderDetail_rowguid DEFAULT (newid())
);

And the following INSERT statement is ran.

INSERT INTO dbo.PurchaseOrderDetail
(
PurchaseOrderID,
LineNumber,
ProductID,
UnitPrice,
OrderQty
)
VALUES
(100, 1, 1, 12.99, 1)
,(100, 2, 3, 15.99, 1)
,(100, 3, 6, 10.99, 2)
,(101, 2, 1, 12.99, 2)
,(101, 3, 2, 11.99, 1)
,(101, 4, 6, 10.99, 5)
,(102, 1, 1, 12.99, 8)
,(103, 1, 4, 19.99, 1);

Giving you the following data set.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERID	LINENUMBER	PRODUCTID	UNITPRICE	ORDERQTY	ROWGUID
100	1	1	12.99	1	A89225C8-3267-491C-B4E1-8688B902BDD3
100	2	3	15.99	1	528F7FC0-5870-47C7-ACD3-3955424DD742
100	3	6	10.99	2	498A1A36-8698-4374-9861-49DCF6514CF5
101	2	1	12.99	2	59820CAF-9209-4EAD-BDF0-D36170084E3C
101	3	2	11.99	1	A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
101	4	6	10.99	5	B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
102	1	1	12.99	8	A24CFF12-F402-488C-9B26-8D2AF0710F36
103	1	4	19.99	1	3044E8AC-027A-4B13-9791-40768F2A4F3E

Now say some bozo runs the insert statement again and you now have the following data set.

PURCHASEORDERID	LINENUMBER	PRODUCTID	UNITPRICE	ORDERQTY	ROWGUID
100	1	1	12.99	1	A89225C8-3267-491C-B4E1-8688B902BDD3
100	2	3	15.99	1	528F7FC0-5870-47C7-ACD3-3955424DD742
100	3	6	10.99	2	498A1A36-8698-4374-9861-49DCF6514CF5
101	2	1	12.99	2	59820CAF-9209-4EAD-BDF0-D36170084E3C
101	3	2	11.99	1	A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
101	4	6	10.99	5	B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
102	1	1	12.99	8	A24CFF12-F402-488C-9B26-8D2AF0710F36
103	1	4	19.99	1	3044E8AC-027A-4B13-9791-40768F2A4F3E
100	1	1	12.99	1	6D0193CD-198F-4A39-991C-9C1CBED4B7A5
100	2	3	15.99	1	5A007E16-76B3-425A-963E-CD1EB3A8A3D4
100	3	6	10.99	2	C01CCEBC-7FDB-4166-8780-6172015949AD
101	2	1	12.99	2	24816420-629B-42D4-94EE-D05D922EF553
101	3	2	11.99	1	8605E6A8-9CF4-4575-B8F4-F68E91D793C5
101	4	6	10.99	5	BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
102	1	1	12.99	8	03DE0A36-6295-46E6-8F4E-C67E36EC2FAE
103	1	4	19.99	1	AE017968-1480-49E6-A9B7-510251C8E33A

The duplicate records have been highlighted. Since the GUID is created a new every time it is ran, the GUID values are different in the duplicate records, but the all the other data is exactly the same.

Say that it takes a rather long time to create the and load the test data, so a TRUNCATE or DROP and recreate are out of the question. What can we do to remove the duplicate data leaving the original data intact?

What we need is to be able to assign an identifier to each row that would be the antonym for the duplicate rows. ROW_NUMBER with a PARTITION BY natural keys (in this example PurchaseOrderID and LineNumber) would server just this case (since we do not care about the ORDER we can use the SELECT NULL trick). One of the records would have the row number of 1 and the duplicate would have the row number of 2, then we could simply DELETE all the records with the row number of 2.

For identifying the records we can create the following common table expression.

WITH cte AS (
SELECT
PurchaseOrderID
,LineNumber
,ROW_NUMBER()
OVER (PARTITION BY PurchaseOrderID, LineNumber ORDER BY (SELECT NULL)) AS num
FROM dbo.PurchaseOrderDetail
)

Which would give use the following (using SELECT * FROM cte)

PURCHASEORDERID	LINENUMBER	NUM
100	1	1
100	1	2
100	2	1
100	2	2
100	3	1
100	3	2
101	2	1
101	2	2
101	3	1
101	3	2
101	4	1
101	4	2
102	1	1
102	1	2
103	1	1
103	1	2

Now we have the duplicate records tagged with a row number of 2, so we can just DELETE those records.

WITH cte AS (
SELECT
PurchaseOrderID
,LineNumber
,ROW_NUMBER()
OVER (PARTITION BY PurchaseOrderID, LineNumber
ORDER BY (SELECT NULL)) AS num
FROM dbo.PurchaseOrderDetail
)
DELETE cte
WHERE num > 1
;

Bingo! We now have a clean data set once again.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERID	LINENUMBER	PRODUCTID	UNITPRICE	ORDERQTY	ROWGUID
100	1	1	12.99	1	A89225C8-3267-491C-B4E1-8688B902BDD3
100	3	6	10.99	2	498A1A36-8698-4374-9861-49DCF6514CF5
101	3	2	11.99	1	A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
102	1	1	12.99	8	A24CFF12-F402-488C-9B26-8D2AF0710F36
100	2	3	15.99	1	5A007E16-76B3-425A-963E-CD1EB3A8A3D4
101	2	1	12.99	2	24816420-629B-42D4-94EE-D05D922EF553
101	4	6	10.99	5	BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
103	1	4	19.99	1	AE017968-1480-49E6-A9B7-510251C8E33A

To design and test out this example I used SQL Fiddle using SQL Server 2012.

You can check out all the code at the following URL: http://sqlfiddle.com/#!6/b2f21/8/5

Philosophy of Programming

About Me

Monday, February 18, 2013

How to Remove Duplicate Records using Windowing Functions

Search This Blog

PURCHASEORDERID	LINENUMBER	NUM
100	1	1
100	1	2
100	2	1
100	2	2
100	3	1
100	3	2
101	2	1
101	2	2
101	3	1
101	3	2
101	4	1
101	4	2
102	1	1
102	1	2
103	1	1
103	1	2

PURCHASEORDERID	LINENUMBER	NUM
100	1	1
100	1	2
100	2	1
100	2	2
100	3	1
100	3	2
101	2	1
101	2	2
101	3	1
101	3	2
101	4	1
101	4	2
102	1	1
102	1	2
103	1	1
103	1	2

PURCHASEORDERID	LINENUMBER	NUM
100	1	1
100	1	2
100	2	1
100	2	2
100	3	1
100	3	2
101	2	1
101	2	2
101	3	1
101	3	2
101	4	1
101	4	2
102	1	1
102	1	2
103	1	1
103	1	2