4

I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.

8
  • 1
    Can you show your table and index definition, I am confused by the fact you say you have a clustered index on the date column, then later you refer to "sequentially scan the table", a table scan is a clustered index scan. Commented Dec 5, 2013 at 21:12
  • 1
    Thanks for the quick response- The table definition is: CREATE TABLE test(dummy_primary_key integer, date_att date, primary key(dummy_primary_key)); Then I create an index on date_att and cluster on it with the following commands: CREATE INDEX dateindex on test(date_att); CLUSTER test using dateindex; Then, running SELECT date_att FROM test WHERE date_att > date '2001-01-01' still takes around 10 seconds. Test holds around a gig of data and the query returns about half of that. From my understanding of clustered indexes, this should be very quick but its still quite slow. Commented Dec 5, 2013 at 21:14
  • Hmmm, afraid this is a mystery to me. I don't know enough about Postgresql to know how the analyser works, what is interesting is that with the table definition you have posted the index is used, as soon as other columns are added to the table the index is no longer used, even though the other columns are not referenced in the query. sqlfiddle.com/#!15/94380/1 Commented Dec 5, 2013 at 21:22
  • Strange... I just tried the same on psql command line and it seqscans no matter what the definition is... Thanks anyways! Commented Dec 5, 2013 at 21:32
  • 2
    Please post the the execution plan from explain analyze (or upload it to explain.depesz.com). Do not post this as a comment, edit your question. You might also want to read this: wiki.postgresql.org/wiki/Slow_Query_Questions Commented Dec 5, 2013 at 22:10

1 Answer 1

11

It seems like you are confusing two concepts:

PostgreSQL clustering of a table

Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:

Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered. http://www.postgresql.org/docs/9.3/static/sql-cluster.html

Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.

E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:

but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?

This brings us to the other concept you mentioned:

Clustered Indexes

This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').

In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).

Hope that clarifies it.

Sign up to request clarification or add additional context in comments.

3 Comments

@josh-hull I've added your suggested edits manually as is was already rejected by the crowd. Stupid System. See also here: meta.stackexchange.com/questions/185472/…
So what are the use-cases for clustering a table in PostgreSQL? I am reading on this topic since last 12 hours and I am utterly confused now.
@KamilLatosinski I've never used PostgreSQL clustering for anything. It's an old feature and I doubt it works well with more modern features (like Index Only Scan). I guess it's use is limited to a very few corner cases nowadays.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.