PostgreSQL and Sequential Data

Question

I have a dataset that contains:

Table { date itemName }

The date for the most part is sequential. There are no duplicates of the date [as it is the primary key].

The question is split up into multiple parts (all with respect to using SQL):

Is it possible to find gaps in the date series listed in the table? For example: Dates 1/2/09-1/3/09 are missing
Is it possible to find sections of dates that are missing from the table, that has a range greater than n (this is a number determined at run time)? For example: For n = 2 Dates 1/2/09-1/3/09 are not returned but Dates 5/6/09-6/1/09 are.

My approach would be to post-process the results .. jeremy.zawodny.com/blog/archives/010523.html .. but if it's possible inside the query, and doesn't hammer the system too badly, that'd be great :) — warren
– warren, Commented Oct 5, 2009 at 1:56
This isn't a live query that'll be used often, its just for maintenance every-once and a while. — monksy
– monksy, Commented Oct 5, 2009 at 2:06

Ants Aasma · Accepted Answer · 2009-10-05 17:14:12Z

10

If you can use PostgreSQL 8.4 then window functions will help:

SELECT *
    FROM (SELECT itemName, date, date - lag(date) OVER w AS gap
              FROM someTable WINDOW w AS (ORDER BY date)
         ) AS pairs
    WHERE pairs.gap > '1 day'::interval;

answered Oct 5, 2009 at 17:14

Ants Aasma

55.3k16 gold badges98 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Greg Smith Over a year ago

This is exactly the sort of problems window functions intend to solve efficiently, and there's no other SQL-only solution that can run as quickly as this.

monksy Over a year ago

Unfortunately, I don't have version 8.4. I have 8.1. I wish I could get credit to this answer. I really like it.

Tometzky · Accepted Answer · 2009-10-05 08:19:51Z

1

Just create a function in plsql or in a client which will be checking all dates. Like this pseudocode:

date checked_date = 2000-01-01;
int unchecked_section = 0;
while ( checked_date <= today() ) {
  if (! sql(select itemName from Table where itemName=checked_date)) {
    unchecked_section++;
  } else {
    if ( unchecked_section>=n ) {
      print checked_date-unchecked_section, checked_date
    }
    unchecked_section = 0;
  }
  checked_date++;
}
if ( unchecked_section ) {
  print checked_date-unchecked_section, checked_date
}

It does not have to be very fast as it is maintenance only. There aren't many dates to check - only 365 a year.

answered Oct 5, 2009 at 8:19

Tometzky

24.2k5 gold badges64 silver badges79 bronze badges

1 Comment

Greg Smith Over a year ago

If you don't have SQL window functions available, this approach is actually the fastest one possible on a large data set, as it only makes one pass over the table. One thing you need to watch is that the SELECT gets ORDER BY so the rows appear in sorted order. And you should use "SELECT min(date),max(date) from table" to get the loop limits--presuming things end at 'today' is not the best idea. PostgreSQL has lots of programming languages available you can run inside the database, PL/pgSQL is the standard one.

Frank Bollack · Accepted Answer · 2009-10-05 09:08:21Z

1

After some testing I came up with the following SQL statement:

SELECT date, itemName
  FROM "Table" as t1
  WHERE NOT EXISTS (
     SELECT date 
     FROM "Table" as t2 
     WHERE t2.date = (t1.date - INTERVAL '1 day')
  )
  ORDER BY date
  OFFSET 1  -- this will skip the first element

This will get you all rows that have no direct successor.

If you modify the statement to:

SELECT date, itemName
  FROM "Table" as t1
  WHERE NOT EXISTS (
    SELECT date 
    FROM "Table" as t2 
    WHERE (t2.date >= (t1.date - INTERVAL '2 day'))
    AND (t2.date < t1.date)
  )
  ORDER BY date
  OFFSET 1

you can use the INTERVAL length in the subselect's WHERE clause to filter by gaps of at least that size.

Hope that helps.

edited Oct 5, 2009 at 9:08

answered Oct 5, 2009 at 8:40

Frank Bollack

25.3k5 gold badges51 silver badges58 bronze badges

2 Comments

Greg Smith Over a year ago

The runtime of this is proportional to the square of the table size, because both the outside SELECT and the inside EXISTS subquery are both doing things whose runtime is proportional to the size of the table. That might seem reasonable at first, but eventually it's going to get really expensive. Unfortunately, any other solution you do in straight SQL is going to suffer from the same issue, because SQL has no row memory. For an n-row table, you have to do a n X n join some way to solve this type of problem. When available, window functions are the best around this.

Frank Bollack Over a year ago

@Greg: Thanks for the analysis. you are right, this isn't the fastest solution, if windows functios are at hand. But PostgreSQL 8.4 is a quite fresh release so there are chances, the OP is using an older version. Also look at the OPs comment on his question for his runtime performance requirements.

Collectives™ on Stack Overflow

PostgreSQL and Sequential Data

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related