PostgreSQL efficient query with filter over boolean

Question

There's table with 15M rows holding user's inbox data

 user_id         | integer                  | not null
 subject         | character varying(255)   | not null 
...
 last_message_id | integer                  | 
 last_message_at | timestamp with time zone |
 deleted_at      | timestamp with time zone |

Here's slow query in nutshell:

SELECT * 
FROM dialogs 
WHERE user_id = 1234 
AND deleted_at IS NULL 
LIMIT 21

Full query: (irrelevant fields deleted)

SELECT "dialogs"."id", "dialogs"."subject", "dialogs"."product_id", "dialogs"."user_id", "dialogs"."participant_id", "dialogs"."thread_id", "dialogs"."last_message_id", "dialogs"."last_message_at", "dialogs"."read_at", "dialogs"."deleted_at", "products"."id", ... , T4."id", ... , "messages"."id", ...,  
FROM "dialogs" 
LEFT OUTER JOIN "products" ON ("dialogs"."product_id" = "products"."id") 
INNER JOIN "auth_user" T4 ON ("dialogs"."participant_id" = T4."id")
LEFT OUTER JOIN "messages" ON ("dialogs"."last_message_id" = "messages"."id") 
WHERE ("dialogs"."deleted_at" IS NULL AND "dialogs"."user_id" = 9069) 
ORDER BY "dialogs"."last_message_id" DESC
LIMIT 21;

EXPLAIN:

Limit  (cost=1.85..28061.24 rows=21 width=1693) (actual time=4.700..93087.871 rows=17 loops=1)
  ->  Nested Loop Left Join  (cost=1.85..9707215.30 rows=7265 width=1693) (actual time=4.699..93087.861 rows=17 loops=1)
        ->  Nested Loop  (cost=1.41..9647421.07 rows=7265 width=1457) (actual time=4.689..93062.481 rows=17 loops=1)
              ->  Nested Loop Left Join  (cost=0.99..9611285.66 rows=7265 width=1115) (actual time=4.676..93062.292 rows=17 loops=1)
                    ->  Index Scan Backward using dialogs_last_message_id on dialogs  (cost=0.56..9554417.92 rows=7265 width=102) (actual time=4.629..93062.050 rows=17 loops=1)
                          Filter: ((deleted_at IS NULL) AND (user_id = 9069))
                          Rows Removed by Filter: 6852907
                    ->  Index Scan using products_pkey on products  (cost=0.43..7.82 rows=1 width=1013) (actual time=0.012..0.012 rows=1 loops=17)
                          Index Cond: (dialogs.product_id = id)
              ->  Index Scan using auth_user_pkey on auth_user t4  (cost=0.42..4.96 rows=1 width=342) (actual time=0.009..0.010 rows=1 loops=17)
                    Index Cond: (id = dialogs.participant_id)
        ->  Index Scan using messages_pkey on messages  (cost=0.44..8.22 rows=1 width=236) (actual time=1.491..1.492 rows=1 loops=17)
              Index Cond: (dialogs.last_message_id = id)
Total runtime: 93091.494 ms
(14 rows)

OFFSET is not used
There's index on user_id field.
Index on deleted_at isn't used because of high selectivity (90% values are actually NULL). Partial index (... WHERE deleted_at IS NULL) won't help either.
It gets especially slow if query hits some part of results that were created long time ago. Then query has to filter and discard millions of rows in between.

List of indexes:

Indexes:
    "dialogs_pkey" PRIMARY KEY, btree (id)
    "dialogs_deleted_at_d57b320e_uniq" btree (deleted_at) WHERE deleted_at IS NULL
    "dialogs_last_message_id" btree (last_message_id)
    "dialogs_participant_id" btree (participant_id)
    "dialogs_product_id" btree (product_id)
    "dialogs_thread_id" btree (thread_id)
    "dialogs_user_id" btree (user_id)

Currently I'm thinking about querying only recent data (i.e. ... WHERE last_message_at > <date 3-6 month ago> with appropriate index (BRIN?).

What is best practice to speed up such queries?

If you run the explain query using only WHERE deleted_at IS NULL do you see anticipated speeds? If so I'd suggest putting an index on both user_id and deleted_at columns in the same index. Usually this will be required because you can't merge two individual indices the way you'd imagine, but storing an index over multiple columns produces the faster query times you're expecting. — Ned Rockson
– Ned Rockson, Commented Jan 11, 2017 at 23:59
you say the index on deleted_at isn't used. But your explain shows it is, there is no seq scan. It's a backwards index scan on dialogs_last_message_id. What's wrong? Paste the full query plan. — Evan Carroll
– Evan Carroll, Commented Jan 12, 2017 at 0:13
Please post your index definitions too. What do you mean by Partial index won't help either? An index on user_id where deleted_at IS NULL should help. — pozs
– pozs, Commented Jan 12, 2017 at 10:36
@EvanCarroll - I guess it's user_id index being used. Filter on deleted_at just loop over result set and dumb-compare deleted_at with NULL until there's 11 items in result. — exslim
– exslim, Commented Jan 12, 2017 at 12:12
Start by creating a partial index on (user_id, last_message_id) with a condition WHERE deleted_at IS NULL. — jcaron
– jcaron, Commented Jan 12, 2017 at 13:35

jcaron · Accepted Answer · 2017-01-31 12:05:51Z

2

As posted in comments:

Start by creating a partial index on (user_id, last_message_id) with a condition WHERE deleted_at IS NULL

Per your answer, this seems to be quite effective :-)

answered Jan 31, 2017 at 12:05

jcaron

17.8k6 gold badges36 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

exslim · Accepted Answer · 2017-01-31 12:01:50Z

So, here's the results of solutions I tried

1) Index (user_id) WHERE deleted_at IS NULL is used in rare cases, depending on certain values user_id in WHERE user_id = ? condition. Most of the time query has to filter out rows as previously.

2) The greatest speedup was achieved using (user_id, last_message_id) WHERE deleted_at IS NULL index. While it's 2.5x bigger than other tested indexes, it's used all the time and is very fast. Here's resulting query plan

Limit  (cost=1.72..270.45 rows=11 width=1308) (actual time=0.105..0.468 rows=8 loops=1)
   ->  Nested Loop Left Join  (cost=1.72..247038.21 rows=10112 width=1308) (actual time=0.104..0.465 rows=8 loops=1)
         ->  Nested Loop  (cost=1.29..164532.13 rows=10112 width=1072) (actual time=0.071..0.293 rows=8 loops=1)
               ->  Nested Loop Left Join  (cost=0.86..116292.45 rows=10112 width=736) (actual time=0.057..0.198 rows=8 loops=1)
                     ->  Index Scan Backward using dialogs_user_id_last_message_id_d57b320e on dialogs  (cost=0.43..38842.21 rows=10112 width=102) (actual time=0.038..0.084 rows=8 loops=1)
                           Index Cond: (user_id = 9069)
                     ->  Index Scan using products_pkey on products  (cost=0.43..7.65 rows=1 width=634) (actual time=0.012..0.012 rows=1 loops=8)
                           Index Cond: (dialogs.product_id = id)
               ->  Index Scan using auth_user_pkey on auth_user t4  (cost=0.42..4.76 rows=1 width=336) (actual time=0.010..0.010 rows=1 loops=8)
                     Index Cond: (id = dialogs.participant_id)
         ->  Index Scan using messages_pkey on messages  (cost=0.44..8.15 rows=1 width=236) (actual time=0.019..0.020 rows=1 loops=8)
               Index Cond: (dialogs.last_message_id = id)
 Total runtime: 0.678 ms

Thanks @jcaron. Your suggestion should be an accepted answer.

Collectives™ on Stack Overflow

PostgreSQL efficient query with filter over boolean

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related