5

I have a dataset that contains:

Table { date itemName }

The date for the most part is sequential. There are no duplicates of the date [as it is the primary key].

The question is split up into multiple parts (all with respect to using SQL):

  1. Is it possible to find gaps in the date series listed in the table? For example: Dates 1/2/09-1/3/09 are missing
  2. Is it possible to find sections of dates that are missing from the table, that has a range greater than n (this is a number determined at run time)? For example: For n = 2 Dates 1/2/09-1/3/09 are not returned but Dates 5/6/09-6/1/09 are.
2
  • My approach would be to post-process the results .. jeremy.zawodny.com/blog/archives/010523.html .. but if it's possible inside the query, and doesn't hammer the system too badly, that'd be great :) Commented Oct 5, 2009 at 1:56
  • This isn't a live query that'll be used often, its just for maintenance every-once and a while. Commented Oct 5, 2009 at 2:06

3 Answers 3

10

If you can use PostgreSQL 8.4 then window functions will help:

SELECT *
    FROM (SELECT itemName, date, date - lag(date) OVER w AS gap
              FROM someTable WINDOW w AS (ORDER BY date)
         ) AS pairs
    WHERE pairs.gap > '1 day'::interval;
Sign up to request clarification or add additional context in comments.

2 Comments

This is exactly the sort of problems window functions intend to solve efficiently, and there's no other SQL-only solution that can run as quickly as this.
Unfortunately, I don't have version 8.4. I have 8.1. I wish I could get credit to this answer. I really like it.
1

Just create a function in plsql or in a client which will be checking all dates. Like this pseudocode:

date checked_date = 2000-01-01;
int unchecked_section = 0;
while ( checked_date <= today() ) {
  if (! sql(select itemName from Table where itemName=checked_date)) {
    unchecked_section++;
  } else {
    if ( unchecked_section>=n ) {
      print checked_date-unchecked_section, checked_date
    }
    unchecked_section = 0;
  }
  checked_date++;
}
if ( unchecked_section ) {
  print checked_date-unchecked_section, checked_date
}

It does not have to be very fast as it is maintenance only. There aren't many dates to check - only 365 a year.

1 Comment

If you don't have SQL window functions available, this approach is actually the fastest one possible on a large data set, as it only makes one pass over the table. One thing you need to watch is that the SELECT gets ORDER BY so the rows appear in sorted order. And you should use "SELECT min(date),max(date) from table" to get the loop limits--presuming things end at 'today' is not the best idea. PostgreSQL has lots of programming languages available you can run inside the database, PL/pgSQL is the standard one.
1

After some testing I came up with the following SQL statement:

SELECT date, itemName
  FROM "Table" as t1
  WHERE NOT EXISTS (
     SELECT date 
     FROM "Table" as t2 
     WHERE t2.date = (t1.date - INTERVAL '1 day')
  )
  ORDER BY date
  OFFSET 1  -- this will skip the first element

This will get you all rows that have no direct successor.

If you modify the statement to:

SELECT date, itemName
  FROM "Table" as t1
  WHERE NOT EXISTS (
    SELECT date 
    FROM "Table" as t2 
    WHERE (t2.date >= (t1.date - INTERVAL '2 day'))
    AND (t2.date < t1.date)
  )
  ORDER BY date
  OFFSET 1

you can use the INTERVAL length in the subselect's WHERE clause to filter by gaps of at least that size.

Hope that helps.

2 Comments

The runtime of this is proportional to the square of the table size, because both the outside SELECT and the inside EXISTS subquery are both doing things whose runtime is proportional to the size of the table. That might seem reasonable at first, but eventually it's going to get really expensive. Unfortunately, any other solution you do in straight SQL is going to suffer from the same issue, because SQL has no row memory. For an n-row table, you have to do a n X n join some way to solve this type of problem. When available, window functions are the best around this.
@Greg: Thanks for the analysis. you are right, this isn't the fastest solution, if windows functios are at hand. But PostgreSQL 8.4 is a quite fresh release so there are chances, the OP is using an older version. Also look at the OPs comment on his question for his runtime performance requirements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.