0

I have table (over 100 millions records) on PostgreSQL 13.1

CREATE TABLE report
(
    id     serial primary key,
    license_plate_id integer,
    datetime timestamp
);

Indexes (for test I create both of them):

create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);

So, my question is why query like

select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100

Is very slow (~10sec). But query without order statement is fast (milliseconds).

Explain:

explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit  (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
  Buffers: shared hit=103
  ->  Index Scan using report_lp_id_idx on report r  (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
        Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
        Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms


explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit  (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
  Buffers: shared hit=11455 read=671
  ->  Sort  (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
        Sort Key: datetime DESC
        Sort Method: top-N heapsort  Memory: 128kB
        Buffers: shared hit=11455 read=671
        ->  Bitmap Heap Scan on report r  (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
              Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
              Heap Blocks: exact=12063
              Buffers: shared hit=11455 read=671
              ->  Bitmap Index Scan on report_lp_id_idx  (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
                    Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
                    Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms

2 Answers 2

1

You seem to have rather slow storage, if reading 671 8kB-blocks from disk takes a couple of seconds.

The way to speed this up is to reorder the table in the same way as the index, so that you can find the required rows in the same or adjacent table blocks:

CLUSTER report_lp_id_idx USING report_lp_id_idx;

Be warned that rewriting the table in this way causes downtime – the table will not be available while it is being rewritten. Moreover, PostgreSQL does not maintain the table order, so subsequent data modifications will cause performance to gradually deteriorate, so that after a while you will have to run CLUSTER again.

But if you need this query to be fast no matter what, CLUSTER is the way to go.

Sign up to request clarification or add additional context in comments.

Comments

0

Your two indices do exactly the same thing, so you can remove the second one, it's useless.

To optimize your query, the order of the fields inside the index must be reversed:

create index report_lp_datetime_index on report (datetime,license_plate_id);


BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;

 Limit  (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
   ->  Index Only Scan Backward using foo_d_i on foo  (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
         Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
         Rows Removed by Filter: 9016
         Heap Fetches: 0
 Planning Time: 0.339 ms
 Execution Time: 9.387 ms

Note the index is not used to optimize the WHERE clause. It is used here as a compact and fast way to store references to the rows ordered by date DESC, so the ORDER BY can do an index-only scan and avoid sorting. By adding column id to the index, an index-only scan can be performed to test the condition on id, without hitting the table for every row. Since there is a low LIMIT value it does not need to scan the whole index, it only scans it in date DESC order until it finds enough rows satisfying the WHERE condition to return the result.

It will be faster if you create the index in date DESC order, this could be useful if you use ORDER BY date DESC + LIMIT in other queries too.

You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.

Easy to work around. The optimum way to do this query would be an index-only scan to filter on WHERE conditions, then LIMIT, then hit the table to get the rows. For some reason if "select *" is used postgres takes the id column from the table instead of taking it from the index, which results in lots of unnecessary heap fetches for rows whose id is rejected by the WHERE condition.

Easy to work around, by doing it manually. I've also added another bogus column to make sure the SELECT * hits the table.

EXPLAIN (ANALYZE,buffers) SELECT * FROM foo 
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i) 
ORDER BY d DESC LIMIT 100;

 Limit  (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
   Buffers: shared hit=453
   ->  Nested Loop  (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
         Buffers: shared hit=453
         ->  Limit  (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
               Buffers: shared hit=53
               ->  Index Only Scan using foo_d_i on foo foo_1  (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
                     Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
                     Rows Removed by Filter: 9010
                     Heap Fetches: 0
                     Buffers: shared hit=53
         ->  Index Scan using foo_d_i on foo  (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
               Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
               Buffers: shared hit=400
 Execution Time: 3.663 ms

Another option is to just add the primary key to the date,license_plate index.

SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;

 Limit  (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
   Buffers: shared hit=473
   ->  Sort  (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
         Sort Key: foo.d DESC
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=473
         ->  Nested Loop  (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
               Buffers: shared hit=473
               ->  Limit  (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
                     Buffers: shared hit=73
                     ->  Index Only Scan using foo_d_i_id on foo foo_1  (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
                           Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
                           Rows Removed by Filter: 9010
                           Heap Fetches: 0
                           Buffers: shared hit=73
               ->  Index Scan using foo_pkey on foo  (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
                     Index Cond: (id = foo_1.id)
                     Buffers: shared hit=400
Execution Time: 3.972 ms

Edit

After thinking about it... since the LIMIT restricts the output to 100 rows ordered by date desc, wouldn't it be nice if we could get the 100 most recent rows for each license_plate_id, put all that into a top-n sort, and only keep the best 100 for all license_plate_ids? That would avoid reading and throwing away a lot of rows from the index. Even if that's much faster than hitting the table, it will still load up these index pages in RAM and clog up your buffers with stuff you don't actually need to keep in cache. Let's use LATERAL JOIN:

EXPLAIN (ANALYZE,BUFFERS) 
SELECT * FROM foo 
  JOIN (SELECT d,i FROM 
    (VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist 
    CROSS JOIN LATERAL 
    (SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2 
    ORDER BY d DESC LIMIT 100
  ) f3 USING (d,i)
  ORDER BY d DESC LIMIT 100;

It's even faster: 2ms, and it uses the index on (license_plate_id,date) instead of the other way around. Also, and this is important, since each subquery in the lateral hits only the index pages that contain rows that will actually be selected, while the previous queries hit much more index pages. So you save on RAM buffers.

If you don't need the index on (date,license_plate_id) and don't want to keep a useless index, that could be interesting since this query doesn't use it. On the other hand, if you need the index on (date,license_plate_id) for something else and want to keep it, then... maybe not.

Please post results for the winning query 🔥

15 Comments

I take back the "nonsense", but your index doesn't support the WHERE condition (see the Filter). If you remove the useless license_plate_id from the index definition, I'll upvote.
No. Removing license_plate_id from the index would prevent postgres from doing an index-only scan which is fast because it is mostly sequential IO. It would have to hit the table to fetch license_plate_id for each row, which would result in lots of random IO, which is OP's problem. It's pretty much the same idea vs using CLUSTER, except the index is much smaller than the table and always ordered, so it's better, but limited to columns actually stored in the index of course.
You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.
see edit -- what matters is to only hit the table for rows that will satisfy the WHERE condition, and avoid table IO for rows that don't. I expected postgres to do the right thing, ie use the id column from the index in the WHERE, but it did not. Easy to work around.
I've just found a much better solution using LATERAL JOIN
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.