4

explain analyze shows that postgres will use index scanning for my query that fetches rows and performs filtering by date (i.e., 2017-04-14 05:27:51.039):

explain analyze select * from tbl t where updated > '2017-04-14 05:27:51.039';
                                                          QUERY PLAN                                                          
 -----------------------------------------------------------------------------------------------------------------------------
  Index Scan using updated on tbl t  (cost=0.43..7317.12 rows=10418 width=93) (actual time=0.011..0.515 rows=1179 loops=1)
    Index Cond: (updated > '2017-04-14 05:27:51.039'::timestamp without time zone)
  Planning time: 0.102 ms
  Execution time: 0.720 ms

however running the same query but with different date filter '2016-04-14 05:27:51.039' shows that postgres will run the query using seq scan instead:

explain analyze select * from tbl t where updated > '2016-04-14 05:27:51.039';
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Seq Scan on tbl t  (cost=0.00..176103.94 rows=5936959 width=93) (actual time=0.008..2005.455 rows=5871963 loops=1)
   Filter: (updated > '2016-04-14 05:27:51.039'::timestamp without time zone)
   Rows Removed by Filter: 947
 Planning time: 0.100 ms
 Execution time: 2910.086 ms

How does postgres decide on what to use, specifically when performing filtering by date?

2 Answers 2

11

The Postgres query planner bases its decisions on cost estimates and column statistics, which are gathered by ANALYZE and opportunistically by some other utility commands. That all happens automatically when autovacuum is on (by default).

The manual:

Most queries retrieve only a fraction of the rows in a table, due to WHERE clauses that restrict the rows to be examined. The planner thus needs to make an estimate of the selectivity of WHERE clauses, that is, the fraction of rows that match each condition in the WHERE clause. The information used for this task is stored in the pg_statistic system catalog. Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly updated.

There is a row count (in pg_class), a list of most common values, etc.

The more rows Postgres expects to find, the more likely it will switch to a sequential scan, which is cheaper to retrieve large portions of a table.

Generally, it's index scan -> bitmap index scan -> sequential scan, the more rows are expected to be retrieved.

For your particular example, the important statistic is histogram_bounds, which give Postgres a rough idea how many rows have a greater value than the given one. There is the more convenient view pg_stats for the human eye:

SELECT histogram_bounds
FROM   pg_stats
WHERE  tablename = 'tbl'
AND    attname = 'updated';

There is a dedicated chapter explaining row estimation in the manual.

Sign up to request clarification or add additional context in comments.

Comments

3

Obviously, optimization of queries is tricky. This answer is not intended to dive into the specifics of the Postgres optimizer. Instead, it is intended to give you some background on how the decision to use an index is made.

Your first query is estimated to return 10,418 rows. When using an index, the following operations happen:

  • The engine uses the index to find the first value meeting the condition.
  • The engine then loops over the values, finishing when the condition is no longer true.
  • For each value in the index, the engine then looks up the data on the data page.

In other words, there is a little bit of overhead when using the index -- initializing the index and then looking up each data page individually.

When the engine does a full table scan it:

  • Starts with the first record on the first page
  • Does the comparison and accepts or rejects the record
  • Continues sequentially through all data pages

There is no additional overhead. Further, the engine can "pre-load" the next pages to be scanned while processing the current page. This overlap of I/O and processing is a big win.

The point I'm trying to make is that getting the balance between these two can be tricky. Somewhere between 10,418 and 5,936,959, Postgres decides that the index overhead (and fetching the pages randomly) costs more than just scanning the whole table.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.