PostgreSQL slow order

Question

I have table (over 100 millions records) on PostgreSQL 13.1

CREATE TABLE report
(
    id     serial primary key,
    license_plate_id integer,
    datetime timestamp
);

Indexes (for test I create both of them):

create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);

So, my question is why query like

select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100

Is very slow (~10sec). But query without order statement is fast (milliseconds).

Explain:

explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit  (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
  Buffers: shared hit=103
  ->  Index Scan using report_lp_id_idx on report r  (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
        Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
        Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms


explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit  (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
  Buffers: shared hit=11455 read=671
  ->  Sort  (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
        Sort Key: datetime DESC
        Sort Method: top-N heapsort  Memory: 128kB
        Buffers: shared hit=11455 read=671
        ->  Bitmap Heap Scan on report r  (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
              Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
              Heap Blocks: exact=12063
              Buffers: shared hit=11455 read=671
              ->  Bitmap Index Scan on report_lp_id_idx  (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
                    Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
                    Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms

Laurenz Albe · Accepted Answer · 2021-02-01 04:40:11Z

1

You seem to have rather slow storage, if reading 671 8kB-blocks from disk takes a couple of seconds.

The way to speed this up is to reorder the table in the same way as the index, so that you can find the required rows in the same or adjacent table blocks:

CLUSTER report_lp_id_idx USING report_lp_id_idx;

Be warned that rewriting the table in this way causes downtime – the table will not be available while it is being rewritten. Moreover, PostgreSQL does not maintain the table order, so subsequent data modifications will cause performance to gradually deteriorate, so that after a while you will have to run CLUSTER again.

But if you need this query to be fast no matter what, CLUSTER is the way to go.

answered Feb 1, 2021 at 4:40

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bobflux · Accepted Answer · 2021-02-01 12:17:37Z

0

Your two indices do exactly the same thing, so you can remove the second one, it's useless.

To optimize your query, the order of the fields inside the index must be reversed:

create index report_lp_datetime_index on report (datetime,license_plate_id);


BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;

 Limit  (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
   ->  Index Only Scan Backward using foo_d_i on foo  (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
         Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
         Rows Removed by Filter: 9016
         Heap Fetches: 0
 Planning Time: 0.339 ms
 Execution Time: 9.387 ms

Note the index is not used to optimize the WHERE clause. It is used here as a compact and fast way to store references to the rows ordered by date DESC, so the ORDER BY can do an index-only scan and avoid sorting. By adding column id to the index, an index-only scan can be performed to test the condition on id, without hitting the table for every row. Since there is a low LIMIT value it does not need to scan the whole index, it only scans it in date DESC order until it finds enough rows satisfying the WHERE condition to return the result.

It will be faster if you create the index in date DESC order, this could be useful if you use ORDER BY date DESC + LIMIT in other queries too.

You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.

Easy to work around. The optimum way to do this query would be an index-only scan to filter on WHERE conditions, then LIMIT, then hit the table to get the rows. For some reason if "select *" is used postgres takes the id column from the table instead of taking it from the index, which results in lots of unnecessary heap fetches for rows whose id is rejected by the WHERE condition.

Easy to work around, by doing it manually. I've also added another bogus column to make sure the SELECT * hits the table.

EXPLAIN (ANALYZE,buffers) SELECT * FROM foo 
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i) 
ORDER BY d DESC LIMIT 100;

 Limit  (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
   Buffers: shared hit=453
   ->  Nested Loop  (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
         Buffers: shared hit=453
         ->  Limit  (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
               Buffers: shared hit=53
               ->  Index Only Scan using foo_d_i on foo foo_1  (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
                     Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
                     Rows Removed by Filter: 9010
                     Heap Fetches: 0
                     Buffers: shared hit=53
         ->  Index Scan using foo_d_i on foo  (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
               Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
               Buffers: shared hit=400
 Execution Time: 3.663 ms

Another option is to just add the primary key to the date,license_plate index.

SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;

 Limit  (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
   Buffers: shared hit=473
   ->  Sort  (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
         Sort Key: foo.d DESC
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=473
         ->  Nested Loop  (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
               Buffers: shared hit=473
               ->  Limit  (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
                     Buffers: shared hit=73
                     ->  Index Only Scan using foo_d_i_id on foo foo_1  (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
                           Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
                           Rows Removed by Filter: 9010
                           Heap Fetches: 0
                           Buffers: shared hit=73
               ->  Index Scan using foo_pkey on foo  (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
                     Index Cond: (id = foo_1.id)
                     Buffers: shared hit=400
Execution Time: 3.972 ms

Edit

After thinking about it... since the LIMIT restricts the output to 100 rows ordered by date desc, wouldn't it be nice if we could get the 100 most recent rows for each license_plate_id, put all that into a top-n sort, and only keep the best 100 for all license_plate_ids? That would avoid reading and throwing away a lot of rows from the index. Even if that's much faster than hitting the table, it will still load up these index pages in RAM and clog up your buffers with stuff you don't actually need to keep in cache. Let's use LATERAL JOIN:

EXPLAIN (ANALYZE,BUFFERS) 
SELECT * FROM foo 
  JOIN (SELECT d,i FROM 
    (VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist 
    CROSS JOIN LATERAL 
    (SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2 
    ORDER BY d DESC LIMIT 100
  ) f3 USING (d,i)
  ORDER BY d DESC LIMIT 100;

It's even faster: 2ms, and it uses the index on (license_plate_id,date) instead of the other way around. Also, and this is important, since each subquery in the lateral hits only the index pages that contain rows that will actually be selected, while the previous queries hit much more index pages. So you save on RAM buffers.

If you don't need the index on (date,license_plate_id) and don't want to keep a useless index, that could be interesting since this query doesn't use it. On the other hand, if you need the index on (date,license_plate_id) for something else and want to keep it, then... maybe not.

Please post results for the winning query 🔥

edited Feb 1, 2021 at 12:17

answered Jan 31, 2021 at 14:16

bobflux

11.6k3 gold badges31 silver badges33 bronze badges

15 Comments

Laurenz Albe Over a year ago

I take back the "nonsense", but your index doesn't support the WHERE condition (see the Filter). If you remove the useless license_plate_id from the index definition, I'll upvote.

bobflux Over a year ago

No. Removing license_plate_id from the index would prevent postgres from doing an index-only scan which is fast because it is mostly sequential IO. It would have to hit the table to fetch license_plate_id for each row, which would result in lots of random IO, which is OP's problem. It's pretty much the same idea vs using CLUSTER, except the index is much smaller than the table and always ordered, so it's better, but limited to columns actually stored in the index of course.

Laurenz Albe Over a year ago

You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.

bobflux Over a year ago

see edit -- what matters is to only hit the table for rows that will satisfy the WHERE condition, and avoid table IO for rows that don't. I expected postgres to do the right thing, ie use the id column from the index in the WHERE, but it did not. Easy to work around.

bobflux Over a year ago

I've just found a much better solution using LATERAL JOIN

|

Collectives™ on Stack Overflow

PostgreSQL slow order

2 Answers 2

Comments

15 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

15 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related