Philosophy of Programming: CTE

Showing posts with label CTE. Show all posts

Sunday, December 8, 2013

FizzBuzz - Now in SQL Server

"Yet again? What do you here? Shall we give o'er and
drown? Have you a mind to sink?"
-- Shakespeare, The Tempest
Act I, Scene I, Lines 38-39

The great thing about the FizzBuzz kata is that it is very easy to understand, but has nearly an unlimited number of ways to implement it. I am fan of using the right tool for the job, but I do enjoy stretching things to see what they can and cannot do with the hope of gaining a different point of view. Having said that let us take a look at implementing the FizzBuzz kata in SQL using SQL Server 2012.

Overview of FizzBuzz

The goal of the FizzBuzz kata is to create a list of values which move through a filter with the following rules:

If the number given is divisible by 3 then produce "Fizz"
If the number given is divisible by 5 then produce "Buzz"
If the number is not divisible by either 3 or 5 then produce the number given

FizzBuzz has one edge case:

If the number given is divisible by both 3 and 5 then produce "FizzBuzz"

An example would be the following:

We see that:
2 => 2
4 => 4
3 => Fizz
9 => Fizz
5 => Buzz
25 => Buzz
15 => FizzBuzz
45 => FizzBuzz

How to Generate the Dataset in SQL Server

Given that we are going to use SQL Server, the first question is how do we generate a dataset to run through our FizzBuzz? We could create a table with all of the values that we need, but that is not very interesting. Instead we will use a Common Table Expression (CTE) to generate our data.

We'll use a Tally Table as outline by Vinay Pugalia's blog post.

DECLARE
@min AS INT = 1
,@max AS INT = 20

;WITH [range] AS (
SELECT @min AS num
UNION ALL
SELECT num + 1 FROM [range] WHERE num < @max
)
SELECT *
FROM [range]
;

This will generate the number 1 to 20 in a dataset.

http://sqlfiddle.com/#!6/9cd0c/7

We can generate any dataset starting at @min up to @max using this CTE, not bad.

Setting up the Business Logic

With our Common Table Expression set up to generate whatever range of numbers we want, we can now focus on the Business Logic of the problem. Since we are looking for particular values to produce a result in some cases it sounds like the perfect job for the CASE statement.

DECLARE
@min AS INT = 1
,@max AS INT = 20

;WITH [range] AS (
SELECT @min AS num
UNION ALL
SELECT num + 1 FROM [range] WHERE num < @max
)
SELECT
CASE
WHEN num % 15 = 0 THEN 'FizzBuzz'
WHEN num % 3 = 0 THEN 'Fizz'
WHEN num % 5 = 0 THEN 'Buzz'
ELSE CAST(num AS VARCHAR)
END AS value
FROM [range]
;

Gives use the output we would want:

http://sqlfiddle.com/#!6/9cd0c/12

This gets us what we want, but personally I do not like repeating the logic for divisible by 3 and 5 in the WHEN num % 15 = 0 case.

How about if we join to a set for the Fizz and another set for the Buzz? If we could do that we could use the CONCAT string operation to cover the divisible by 15 edge case.

DECLARE
@min AS INT = 1
,@max AS INT = 20

;WITH [range] AS (
SELECT @min AS num
UNION ALL
SELECT num + 1 FROM [range] WHERE num < @max
)
SELECT
CASE
WHEN CONCAT(fizz.value, buzz.value) <> ''
THEN CONCAT(fizz.value, buzz.value)
ELSE CAST(num AS VARCHAR)
END AS value
FROM [range]
LEFT OUTER JOIN (
SELECT
num AS id
,'Fizz' AS value
FROM [range] WHERE num % 3 = 0
) AS fizz
ON fizz.id = [range].num
LEFT OUTER JOIN (
SELECT
num AS id
,'Buzz' AS value
FROM [range] WHERE num % 5 = 0
) AS buzz
ON buzz.id = [range].num
;

We create subselects which give us the Fizz and Buzz filters we are looking for. We then use the CONCAT to join a Fizz to a Buzz which will give us FizzBuzz on numbers divisible by both 3 and 5. If we do not have a Fizz or a Buzz the CONCAT will give us '' (empty string), which means we'll want to return the number as a string.

This gives us:

http://sqlfiddle.com/#!6/9cd0c/20

We can make this a bit more readable using two more CTEs.

DECLARE
@min AS INT = 1
,@max AS INT = 20

;WITH [range] AS (
SELECT @min AS num
UNION ALL
SELECT num + 1 FROM [range] WHERE num < @max
)
, fizz AS (
SELECT
num AS id
,'Fizz' AS value
FROM [range] WHERE num % 3 = 0
)
, buzz AS (
SELECT
num AS id
,'Buzz' AS value
FROM [range] WHERE num % 5 = 0
)
SELECT
CASE
WHEN CONCAT(fizz.value, buzz.value) <> ''
THEN CONCAT(fizz.value, buzz.value)
ELSE CAST(num AS VARCHAR)
END AS value
FROM [range]
LEFT OUTER JOIN fizz
ON fizz.id = [range].num
LEFT OUTER JOIN buzz
ON buzz.id = [range].num
;

Which still gives us the same output.

http://sqlfiddle.com/#!6/9cd0c/22

Personally I find the three CTEs a bit cleaner to read and understand, but I understand that not everyone will feel the same way.

There you have it FizzBuzz using SQL Server. You can find the solution on SQL Fiddle (and see all my typing errors to get to it).

"They say there's no such place... as Paradise. Even if you search to the ends of the Earth, there's nothing there. No matter how far you walk, it's always the same road. It just goes on and on. But, in spite of that... Why am I so driven to find it? A voice calls to me... It says, 'Search for Paradise.'"
-- Keiko Nobumoto, Wolf's Rain
Final Lines

Saturday, November 16, 2013

Working with Multiple Values in a Column with SQL Server

"Any hard lesson that may do thee good."
-- Shakespeare, Much Ado About Nothing
Act I, Scene I, Line 272

SQL is a good tool when you have large sets of relational records--whether or not these kinds of relationships exist in the real world is not the aim of this post. However, SQL is not very good at normalizing multiple records in a single attribute.

I know, an example would be good right about now.

As we see above Attribute 2 has three different values in it (Value 1, Value 2, and Value 3).

Say we want to do the following.

We want to take the Order File's Order Records and place the Order ID and others into the Order Table while placing the Order ID and Flags into the Order Flag Table.

Well we have a problem, how are we going to take the Flags and normalize them into the Order Flag Table?

Say the Flags are separated by spaces so Flags looks like this in an Order Record:

How would we do this in SQL? Hmmm, well SQL is not really good at this kind of process so we need to do a kind of hack.

In SQL Server 2012 (should work in 2008 too) we can transform the Flags into XML that looks like this:

CHS
XMS
Once the data looks like this we can SQL Server's built in XML processing to create a virtual table to obtain the values from.

We can set up a test case using the following a CTE.

;WITH cte(value, id) AS (

SELECT *

FROM (

VALUES

('HERP DREP CHS XMS', 1)

,('HELLO WORLD', 2)

) AS x(a, b)

)

SELECT *

FROM cte

Which looks like this in SQL Fiddle:

http://sqlfiddle.com/#!6/aed15/1

We see that we have two records: ('HERP DREP CHS XMS', 1) and ('HELLO WORLD', 2).

Now lets take the Flags and transform them using an XML trick.

;WITH cte(value, id) AS (
SELECT *
FROM (
VALUES
('HERP DREP CHS XMS', 1)
,('HELLO WORLD', 2)
) AS x(a, b)
)
SELECT
flags.flag.value('.', 'varchar(50)') AS Flag
,id AS "Order ID"
,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS "KEY"
FROM (
SELECT
CAST('<flag>' + REPLACE(value, ' ', '</flag><flag>') + '</flag>' AS XML)
,id
FROM cte
) AS f(x, id)
CROSS APPLY f.x.nodes('flag') AS flags(flag)

We now have this in SQL Fiddle.

http://sqlfiddle.com/#!6/aed15/2

This gives us each Flag with its Order ID (Herp with 1, Derp with 1, CHS with 1, XMS with 1, Hello with 2, and World with 2).

This is almost what we want but we need an unique Key for each Flag across all of the records. Now in the ETL tool I was using when I faced this problem this week, it needed an unique Key for each of the Flag records that would come out of the SQL used, even though it would be mapping this unique Key to an Identity column in SQL Server. Yes, this is very dumb but

When tools do dumb things, one must make sure that one does not do dumb things himself.
-- Me, this blog post

With this in mind, let us create an unique Key in a smart way for each Flag record coming out of our SQL hack.

This gives us the following in SQL Fiddle.

http://sqlfiddle.com/#!6/aed15/3

Now we have an unique Key!

This gives us each Flag with its Order ID and Key (Herp with 1 and 1, Derp with 1 and 2, CHS with 1 and 3, XMS with 1 and 4, Hello with 2 and 5, and World with 2 and 6).

Nice now we can get back to delivering real value to our customers and not just figuring out tools and strange data.

If you want to see other examples using CTEs and SELECT NULL see these other post by me.
http://comp-phil.blogspot.com/2013/02/the-case-of-not-having-unique-identity.html
http://comp-phil.blogspot.com/2013/02/how-to-remove-duplicate-records-using.html

Monday, February 18, 2013

How to Remove Duplicate Records using Windowing Functions

"Making duplicate copies and computer printouts of things no one wanted even one of in the first place is giving America a new sense of purpose. "
-- Andy Rooney

"Or the Department of Education and another ministry were worried about duplication of effort, so what did they do? They set up two committees to look into duplication and neither knew what the other was up to. It really is a world beyond parody."
-- Rory Bremner

It happens every now and then, you create a large data set for some type of testing and since you are trying to "go faster" you do not bother with automating the process and some how one of the data set insertions gets ran twice. Now you have duplicate data in your data set and need to get rid of it.

Say you have the following table.

CREATE TABLE dbo.PurchaseOrderDetail
(
PurchaseOrderID int NOT NULL,
LineNumber smallint NOT NULL,
ProductID int NULL,
UnitPrice money NULL,
OrderQty smallint NULL,
rowguid uniqueidentifier ROWGUIDCOL NOT NULL
CONSTRAINT DF_PurchaseOrderDetail_rowguid DEFAULT (newid())
);

And the following INSERT statement is ran.

INSERT INTO dbo.PurchaseOrderDetail
(
PurchaseOrderID,
LineNumber,
ProductID,
UnitPrice,
OrderQty
)
VALUES
(100, 1, 1, 12.99, 1)
,(100, 2, 3, 15.99, 1)
,(100, 3, 6, 10.99, 2)
,(101, 2, 1, 12.99, 2)
,(101, 3, 2, 11.99, 1)
,(101, 4, 6, 10.99, 5)
,(102, 1, 1, 12.99, 8)
,(103, 1, 4, 19.99, 1);

Giving you the following data set.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERID	LINENUMBER	PRODUCTID	UNITPRICE	ORDERQTY	ROWGUID
100	1	1	12.99	1	A89225C8-3267-491C-B4E1-8688B902BDD3
100	2	3	15.99	1	528F7FC0-5870-47C7-ACD3-3955424DD742
100	3	6	10.99	2	498A1A36-8698-4374-9861-49DCF6514CF5
101	2	1	12.99	2	59820CAF-9209-4EAD-BDF0-D36170084E3C
101	3	2	11.99	1	A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
101	4	6	10.99	5	B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
102	1	1	12.99	8	A24CFF12-F402-488C-9B26-8D2AF0710F36
103	1	4	19.99	1	3044E8AC-027A-4B13-9791-40768F2A4F3E

Now say some bozo runs the insert statement again and you now have the following data set.

PURCHASEORDERID	LINENUMBER	PRODUCTID	UNITPRICE	ORDERQTY	ROWGUID
100	1	1	12.99	1	A89225C8-3267-491C-B4E1-8688B902BDD3
100	2	3	15.99	1	528F7FC0-5870-47C7-ACD3-3955424DD742
100	3	6	10.99	2	498A1A36-8698-4374-9861-49DCF6514CF5
101	2	1	12.99	2	59820CAF-9209-4EAD-BDF0-D36170084E3C
101	3	2	11.99	1	A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
101	4	6	10.99	5	B4021C5F-0FB3-4330-B1FC-AD7C3DE6A936
102	1	1	12.99	8	A24CFF12-F402-488C-9B26-8D2AF0710F36
103	1	4	19.99	1	3044E8AC-027A-4B13-9791-40768F2A4F3E
100	1	1	12.99	1	6D0193CD-198F-4A39-991C-9C1CBED4B7A5
100	2	3	15.99	1	5A007E16-76B3-425A-963E-CD1EB3A8A3D4
100	3	6	10.99	2	C01CCEBC-7FDB-4166-8780-6172015949AD
101	2	1	12.99	2	24816420-629B-42D4-94EE-D05D922EF553
101	3	2	11.99	1	8605E6A8-9CF4-4575-B8F4-F68E91D793C5
101	4	6	10.99	5	BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
102	1	1	12.99	8	03DE0A36-6295-46E6-8F4E-C67E36EC2FAE
103	1	4	19.99	1	AE017968-1480-49E6-A9B7-510251C8E33A

The duplicate records have been highlighted. Since the GUID is created a new every time it is ran, the GUID values are different in the duplicate records, but the all the other data is exactly the same.

Say that it takes a rather long time to create the and load the test data, so a TRUNCATE or DROP and recreate are out of the question. What can we do to remove the duplicate data leaving the original data intact?

What we need is to be able to assign an identifier to each row that would be the antonym for the duplicate rows. ROW_NUMBER with a PARTITION BY natural keys (in this example PurchaseOrderID and LineNumber) would server just this case (since we do not care about the ORDER we can use the SELECT NULL trick). One of the records would have the row number of 1 and the duplicate would have the row number of 2, then we could simply DELETE all the records with the row number of 2.

For identifying the records we can create the following common table expression.

WITH cte AS (
SELECT
PurchaseOrderID
,LineNumber
,ROW_NUMBER()
OVER (PARTITION BY PurchaseOrderID, LineNumber ORDER BY (SELECT NULL)) AS num
FROM dbo.PurchaseOrderDetail
)

Which would give use the following (using SELECT * FROM cte)

PURCHASEORDERID	LINENUMBER	NUM
100	1	1
100	1	2
100	2	1
100	2	2
100	3	1
100	3	2
101	2	1
101	2	2
101	3	1
101	3	2
101	4	1
101	4	2
102	1	1
102	1	2
103	1	1
103	1	2

Now we have the duplicate records tagged with a row number of 2, so we can just DELETE those records.

WITH cte AS (
SELECT
PurchaseOrderID
,LineNumber
,ROW_NUMBER()
OVER (PARTITION BY PurchaseOrderID, LineNumber
ORDER BY (SELECT NULL)) AS num
FROM dbo.PurchaseOrderDetail
)
DELETE cte
WHERE num > 1
;

Bingo! We now have a clean data set once again.

SELECT * FROM dbo.PurchaseOrderDetail
;

PURCHASEORDERID	LINENUMBER	PRODUCTID	UNITPRICE	ORDERQTY	ROWGUID
100	1	1	12.99	1	A89225C8-3267-491C-B4E1-8688B902BDD3
100	3	6	10.99	2	498A1A36-8698-4374-9861-49DCF6514CF5
101	3	2	11.99	1	A2973E38-B0CF-4D98-9ADC-5E26BCDDEE30
102	1	1	12.99	8	A24CFF12-F402-488C-9B26-8D2AF0710F36
100	2	3	15.99	1	5A007E16-76B3-425A-963E-CD1EB3A8A3D4
101	2	1	12.99	2	24816420-629B-42D4-94EE-D05D922EF553
101	4	6	10.99	5	BD54EB82-4A20-484D-A5CF-53DCB9A27CB2
103	1	4	19.99	1	AE017968-1480-49E6-A9B7-510251C8E33A

To design and test out this example I used SQL Fiddle using SQL Server 2012.

You can check out all the code at the following URL: http://sqlfiddle.com/#!6/b2f21/8/5

Philosophy of Programming

About Me