Sunday, April 14, 2013

Don't Mind the Gap

"That I might sleep out this great gap of time
My Antony is away."
-- Shakespeare's Antony and Cleopatra
Act I, Scene V, Lines 5-6


A very common problem with data sets is missing data.  It is such a common problem that we have well know patterns to identify missing data.

How to Find Data Gaps

A data gap is when data should be found either in some type of order or range or ... but is not.  Such as trade data for a security, the security should have some activity everyday, so if there are dates missing and open, close, high, and low then there is a data gap.



In the data set above, we see that 7, 8, 13, and 14 are missing.  In order to find what is missing we need to sort, remove duplicate records, group the data by a range id, and then find the min and max for the range id.



Given the algorithm above we can solve it with the following steps.

Step 1: sort



[1..5] @ [11..13] @ [33..40]
  |> List.sort

> val it : int list =
  [1; 2; 3; 4; 5; 11; 12; 13; 33; 34; 35; 36; 37; 38; 39; 40]

Step 2: dedup


[1..5] @ [11..13] @ [33..40]
  |> Seq.distinct

> val it : seq = seq [1; 2; 3; 4; ...]

Step 3: assign group ids


[1..5] @ [11..13] @ [33..40]
  |> Seq.mapi(fun i x -> (x, x - i))

> val it : seq = seq [(1, 1); (2, 1); (3, 1); (4, 1); ...]

Step 4: group



seq [(1, 1); (2, 1); (3, 1); (4, 1)]
  |> Seq.groupBy snd

> val it : seq> =
  seq [(1, seq [(1, 1); (2, 1); (3, 1); (4, 1)])]

Step 5: find group's min and max




seq [(1, seq [(1, 1); (2, 1); (3, 1); (4, 1)])]
  |> Seq.map (fun (k, v) -> (fst (Seq.min v), fst (Seq.max v)))

> val it : seq = seq [(1, 4)]

Finding Data Gaps with F# (tryfsharp)

let gapper x =
  x
    |> List.sort
    |> Seq.distinct
    |> Seq.mapi
      (fun i x -> (x, x - i))
    |> Seq.groupBy snd
    |> Seq.map
      (fun (k, v) -> 
        (fst (Seq.min v),
         fst (Seq.max v)))
    |> List.ofSeq

Examples


gapper ([1..5] @ [11..13] @ [33..40] @ [20] @ [2..5] @ [2]);;

> val it : (int * int) list = [(1, 5); (11, 13); (20, 20); (33, 40)]

gapper ([1..5] @ [2..9] @ [2..5] @ [2] @ [10..13] @ [102..110]);;

> val it : (int * int) list = [(1, 13); (102, 110)]

I blogged about this problem before in how to Find Islands of Data using T-SQL after reading about the problem in Itzik Ben-Gan's book High-Performance T-SQL Using Windowing Functions (which is really good).  

Finding Data Gaps with T-SQL (SQL Fiddle)

SELECT DISTINCT
   MIN(col1) OVER(PARTITION BY groupid) AS start_range
  ,MAX(col1) OVER(PARTITION BY groupid) AS end_range
  FROM (
    SELECT 
       col1
      ,col1 - ROW_NUMBER() OVER(ORDER BY col1) AS groupid
      FROM dbo.t1
  ) AS x
;