Saturday, February 23, 2013

Upserting in SQL Server

"There is nothing more uncommon than common sense."
--- Frank Lloyd Wright

"Less is more."
-- Ludwig Mies van der Rohe

It happens, your on a project and some architect comes up with some "great" idea.  Something along the lines of, "Instead of deleting data we'll do a "soft" delete, that way if we need to unefface the data will not have to go to the backed up source data or the data warehouse or one of the many other copies we have".  There you are, stuck living with faking a deletion of data, adding unnecessary complexity  to your system.



One of the great things about a "soft" delete is now you need to check if the record already exist as a "soft" deleted record before you can insert any new data.  Great.  So how do you do this?

The wrong way to do it would be to do a SELECT and then either an INSERT or UPDATE based on the results of the SELECT.  This causes two different interaction with the database.

The better way to handle a "soft" delete is to use a MERGE for an Upsert.

Say you have the following tables.


CREATE TABLE dbo.stage
(
  id [int] NOT NULL,
  first_name [varchar] (100) NOT NULL,
  last_name [varchar] (100) NOT NULL
);

CREATE TABLE dbo.party
(
  id [int] NOT NULL,
  first_name [varchar] (100) NOT NULL,
  last_name [varchar] (100) NOT NULL,
  is_deleted [char] DEFAULT('N') NOT NULL
);


With the following data.


INSERT INTO dbo.stage
(
  id
 ,first_name
 ,last_name
)
VALUES
  (1, 'Mike', 'Harris')
 ,(2, 'Kelsey', 'Harris')
 ,(3, 'Jack', 'Harris')
;

INSERT INTO dbo.party
(
  id
 ,first_name
 ,last_name
 ,is_deleted
)
VALUES
  (1, 'Mike', 'Harris', 'Y')
 ,(2, 'Kelsey', 'Hopkins', 'N')
 ,(4, 'Tom', 'Smith', 'N')
;


Staging
IDFIRST_NAMELAST_NAME
1MikeHarris
2KelseyHarris
3JackHarris

Party with "soft" delete
IDFIRST_NAMELAST_NAMEIS_DELETED
1MikeHarrisY
2KelseyHopkinsN
4TomSmithN

Looking at the data above, we see that the record with ID value of 1 has been "soft" deleted.

Say we want to merge the staging data into the Party data.  Assuming we have a full data set, we can "delete" any of the records not found in the staging data.

In order to merge the staging data into the Party data with "soft" deletes we would use the following MERGE statement.


MERGE dbo.party AS TARGET
USING dbo.stage AS SOURCE
  ON TARGET.id = SOURCE.id
WHEN MATCHED
  AND (TARGET.first_name <> SOURCE.first_name
       OR TARGET.last_name <> SOURCE.last_name)
   OR TARGET.is_deleted = 'Y' THEN
  UPDATE
    SET TARGET.first_name = SOURCE.first_name
       ,TARGET.last_name = SOURCE.last_name
       ,TARGET.is_deleted = 'N'
WHEN NOT MATCHED BY TARGET THEN
  INSERT (id, first_name, last_name)
  VALUES (SOURCE.id, SOURCE.first_name, SOURCE.last_name)
WHEN NOT MATCHED BY SOURCE THEN
  UPDATE
    SET TARGET.is_deleted = 'Y'
;

The statements in bold are what make this MERGE statement use "soft" deletes.  On the WHEN MATCHED we want to include an OR TARGET.is_deleted = 'Y' (or whatever kind of delete indicator you are using), this will allow for an UPDATE to take place when the record that is being inserted is a "soft" deleted record.  The WHEN NOT MATCHED BY SOURCE THEN must do an UPDATE with SET TARGET.is_deleted = 'Y' (or whatever kind of delete indicator you are using) instead of the typical DELETE.

Party with "soft" delete
IDFIRST_NAMELAST_NAMEIS_DELETED
1MikeHarrisN
2KelseyHarrisN
4TomSmithY
3JackHarrisN

In the table above, Mike has been un-"soft" deleted, Kelsey's last name has changed to Harris, Tom has been "soft" deleted, and Jack has been inserted into the table.

Well, that is how one does a "soft" delete in SQL Server.

I used SQL Fiddle to test this out, the sample code can be found here.

Monday, February 18, 2013

How to Remove Duplicate Records using Windowing Functions

"Making duplicate copies and computer printouts of things no one wanted even one of in the first place is giving America a new sense of purpose. "
-- Andy Rooney

"Or the Department of Education and another ministry were worried about duplication of effort, so what did they do? They set up two committees to look into duplication and neither knew what the other was up to. It really is a world beyond parody."
-- Rory Bremner

It happens every now and then, you create a large data set for some type of testing and since you are trying to "go faster" you do not bother with automating the process and some how one of the data set insertions gets ran twice.  Now you have duplicate data in your data set and need to get rid of it.

Say you have the following table.

CREATE TABLE dbo.PurchaseOrderDetail
(
    PurchaseOrderID int NOT NULL,
    LineNumber smallint NOT NULL,
    ProductID int NULL,
    UnitPrice money NULL,
    OrderQty smallint NULL,
    rowguid uniqueidentifier ROWGUIDCOL  NOT NULL
        CONSTRAINT DF_PurchaseOrderDetail_rowguid DEFAULT (newid())
);

And the following INSERT statement is ran.

INSERT INTO dbo.PurchaseOrderDetail
(
  PurchaseOrderID,
  LineNumber,
  ProductID,
  UnitPrice,
  OrderQty
)
VALUES
 (100, 1, 1, 12.99, 1)
,(100, 2, 3, 15.99, 1)
,(100, 3, 6, 10.99, 2)
,(101, 2, 1, 12.99, 2)
,(101, 3, 2, 11.99, 1)
,(101, 4, 6, 10.99, 5)
,(102, 1, 1, 12.99, 8)
,(103, 1, 4, 19.99, 1);

Giving you the following data set.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERIDLINENUMBERPRODUCTIDUNITPRICEORDERQTYROWGUID
1001112.991A89225C8-3267-491C-B4E1-8688B902BDD3
1002315.991528F7FC0-5870-47C7-ACD3-3955424DD742
1003610.992498A1A36-8698-4374-9861-49DCF6514CF5
1012112.99259820CAF-9209-4EAD-BDF0-D36170084E3C
1013211.991A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
1014610.995B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
1021112.998A24CFF12-F402-488C-9B26-8D2AF0710F36
1031419.9913044E8AC-027A-4B13-9791-40768F2A4F3E
Now say some bozo runs the insert statement again and you now have the following data set.

PURCHASEORDERIDLINENUMBERPRODUCTIDUNITPRICEORDERQTYROWGUID
1001112.991A89225C8-3267-491C-B4E1-8688B902BDD3
1002315.991528F7FC0-5870-47C7-ACD3-3955424DD742
1003610.992498A1A36-8698-4374-9861-49DCF6514CF5
1012112.99259820CAF-9209-4EAD-BDF0-D36170084E3C
1013211.991A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
1014610.995B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
1021112.998A24CFF12-F402-488C-9B26-8D2AF0710F36
1031419.9913044E8AC-027A-4B13-9791-40768F2A4F3E
1001112.9916D0193CD-198F-4A39-991C-9C1CBED4B7A5
1002315.9915A007E16-76B3-425A-963E-CD1EB3A8A3D4
1003610.992C01CCEBC-7FDB-4166-8780-6172015949AD
1012112.99224816420-629B-42D4-94EE-D05D922EF553
1013211.9918605E6A8-9CF4-4575-B8F4-F68E91D793C5
1014610.995BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
1021112.99803DE0A36-6295-46E6-8F4E-C67E36EC2FAE
1031419.991AE017968-1480-49E6-A9B7-510251C8E33A
The duplicate records have been highlighted.  Since the GUID is created a new every time it is ran, the GUID values are different in the duplicate records, but the all the other data is exactly the same.

Say that it takes a rather long time to create the and load the test data, so a TRUNCATE or DROP and recreate are out of the question.  What can we do to remove the duplicate data leaving the original data intact?

What we need is to be able to assign an identifier to each row that would be the antonym for the duplicate rows.  ROW_NUMBER with a PARTITION BY natural keys (in this example PurchaseOrderID and LineNumber) would server just this case (since we do not care about the ORDER we can use the SELECT NULL trick).  One of the records would have the row number of 1 and the duplicate would have the row number of 2, then we could simply DELETE all the records with the row number of 2.

For identifying the records we can create the following common table expression.

WITH cte AS (
  SELECT
    PurchaseOrderID
   ,LineNumber
   ,ROW_NUMBER()
      OVER (PARTITION BY PurchaseOrderID, LineNumber ORDER BY (SELECT NULL)) AS num
  FROM dbo.PurchaseOrderDetail
)

Which would give use the following (using SELECT * FROM cte)

PURCHASEORDERIDLINENUMBERNUM
10011
10012
10021
10022
10031
10032
10121
10122
10131
10132
10141
10142
10211
10212
10311
10312
Now we have the duplicate records tagged with a row number of 2, so we can just DELETE those records.

WITH cte AS (
  SELECT
    PurchaseOrderID
   ,LineNumber
   ,ROW_NUMBER()
      OVER (PARTITION BY PurchaseOrderID, LineNumber
                   ORDER BY (SELECT NULL)) AS num
  FROM dbo.PurchaseOrderDetail
)
DELETE cte
  WHERE num > 1
;

Bingo!  We now have a clean data set once again.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERIDLINENUMBERPRODUCTIDUNITPRICEORDERQTYROWGUID
1001112.991A89225C8-3267-491C-B4E1-8688B902BDD3
1003610.992498A1A36-8698-4374-9861-49DCF6514CF5
1013211.991A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
1021112.998A24CFF12-F402-488C-9B26-8D2AF0710F36
1002315.9915A007E16-76B3-425A-963E-CD1EB3A8A3D4
1012112.99224816420-629B-42D4-94EE-D05D922EF553
1014610.995BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
1031419.991AE017968-1480-49E6-A9B7-510251C8E33A
To design and test out this example I used SQL Fiddle using SQL Server 2012.

You can check out all the code at the following URL: http://sqlfiddle.com/#!6/b2f21/8/5

Sunday, February 10, 2013

The Case of Not Having an Unique Identity

"Always remember that you're unique. Just like everyone else."
-- Alison Boulter

Every now and then it happens, you get a data set that has been corrupted and does not have a unique identity column.  What to do?  Well you can copy the data into a temp table that has an identity, truncate the table that it is in, and then copy the data from the temp table back into the original table.  This will work but it would take a long time to execute on a large data set and it assumes you can run truncate (otherwise you would have to run delete and fill the transaction log).

Your other option would be to come up with a statement that would assign a unique identity to each row and then use this to update the table.  Luckily the windowing function ROW_NUMBER can be use for just this task.

Say you have the following table:


CREATE TABLE dbo.PurchaseOrderDetail
(
    PurchaseOrderID int NOT NULL,
    LineNumber smallint NOT NULL,
    ProductID int NULL,
    UnitPrice money NULL,
    OrderQty smallint NULL
);


And the following data:


INSERT INTO dbo.PurchaseOrderDetail
(
    PurchaseOrderID,
    LineNumber,
    ProductID,
    UnitPrice,
    OrderQty
)
VALUES (1, 110, 201, 10.99, 3),
       (1, 112, 2, 5.98, 1),
       (1, 113, 21, 11.99, 2),
       (1, 220, 1, 1.99, 5),
       (1, 210, 1, 1.99, 6),
       (1, 410, 201, 10.99, 7),
       (1, 50, 201, 10.99, 31),
       (1, 500, 301, 99.99, 1),
       (1, 116, 21, 11.99, 1),
       (1, 100, 2, 5.98, 1),
       (1, 10, 201, 10.99, 1);


As you can see there is an issue, the PurchaseOrderID has been corrupted.  Its value is always 1.  It should have a unique identity.  Let's use the ROW_NUMBER function in a CTE to give us an unique identity for each row.


WITH cte
AS
(
  SELECT
     PurchaseOrderID
    ,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS num
    FROM dbo.PurchaseOrderDetail
)


We use a little trick of ORDERing BY SELECT NULL since we do not care about order in which we assign the row numbers and we need to have an ORDER BY for ROW_NUMBER.

Now all we need to do is use this CTE in an UPDATE to change the PurchaseOrderID to ROW_NUMBER value.


WITH cte
AS
(
  SELECT
     PurchaseOrderID
    ,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS num
    FROM dbo.PurchaseOrderDetail
)
UPDATE cte
  SET PurchaseOrderID = num;


With this one simple statement we have just assigned a unique identity to the PurchaseOrderID.  Now quickly ALTER the table and make it a primary key, so this does not happen again!

This whole solution can be found on SQL Fiddle here.