Monday, January 21, 2013

Finding Islands of Data

"But the universe, as a collection of finite things, presents itself as a kind of island situated in a pure vacuity to which time, regarded as a series of mutually exclusive moments, is nothing and does nothing."

When I first started using SQL Server 2008 one of the most interesting things I stumbled across was the OVER clause.  At first when I started using I enjoyed that I could use aggregate functions without the need for a GROUP BY clause, in fact I aggregate over different sets of data in the same SELECT statement!

Having grown more use to the idea of thinking in sets, I now see OVER clause and windowing functions in general are very valuable tools in the SQL tool belt.

Using the following:

CREATE TABLE #t1 (
  col1 int NOT NULL
);
GO

INSERT INTO #t1
  VALUES (2),(3),(4),(9),(11),(12),(13),(14),(22),(23),(45),(46);
GO

We create a table with one column and some integer values.

With just this simple data set we can demonstrate how islands of day can be found.  What is an island of data?  An island of data is simply a gap in the data.  Say you get a data feed of security prices, but some of the data is incomplete for some days (maybe a security is thinly traded or maybe whomever you are getting the feed from has some issues).  These gaps in the data form what is called data islands.

Looking at our simple data we can see a few data islands.

col1
2
3
4
9
11
12
13
14
22
23
45
46

The different data islands have been highlighted in different colors.  It would be nice if the data was a way that the different colors had different group identifiers so that we do a GROUP BY (or something similar) on the identifier.  As Polya shows us in How to Solve It, if we can solve an auxiliary problem perhaps we can solve this one.

If we use the ROW_NUMBER window function and ORDER BY the value in col1 we would get the following.

SELECT col1, ROW_NUMBER() OVER(ORDER BY col1) AS rowid
  FROM #t1;
GO

col1 rowid
2 1
3 2
4 3
9 4
11 5
12 6
13 7
14 8
22 9
23 10
45 11
46 12

All we did was simply return the ROW_NUMBER assigning the number based on the values in col1.  Thus 2 is assigned 1 since it is the lowest value, 9 is assigned 4 since there are three values lower than it, 46 is assigned 12 since all eleven other values are lower than it, and so on.  Each ROW_NUMBER is assigned in relation to all the other values it is based on, in this case the number is assigned based on the value in col1.

If we look closer we will find that the difference between the rowid and col1 holds constant for the whole group meaning that we can simple subtract the rowid and col1 to get a value that we can GROUP BY as an identifier.

SELECT col1, col1 - ROW_NUMBER() OVER(ORDER BY col1) AS groupid
  FROM #t1;
GO

col1 groupid
2 1
3 1
4 1
9 5
11 6
12 6
13 6
14 6
22 13
23 13
45 34
46 34

Now we have our identifier to GROUP BY we can do the following to get the range of each island of data.

SELECT DISTINCT
   MIN(col1) OVER(PARTITION BY groupid) AS start_range
  ,MAX(col1) OVER(PARTITION BY groupid) AS end_range
  FROM (
    SELECT 
       col1
      ,col1 - ROW_NUMBER() OVER(ORDER BY col1) AS groupid
      FROM #t1
  ) AS x;
GO

start_rangeend_range
24
99
1114
2223
4546

The PARTITION BY in the OVER clause is very similar to a GROUP BY, the difference being that the PARTITION BY takes the value in the current row and looks at all the related values in the current window, or simply in the case where col1 is equal to 3 the MIN ans MAX are found for all of the col1's with the related groupid of 1, which would be the value 2 ans 4.  Since the OVER clause is looked at for each row of data we need to use a DISTINCT to remove the duplicates.

There you have it, you can find islands of data by using simple windowing functions.

I've placed this code on SQL Fiddle here.