Postgres: Performance Issue: Query on enormous data fails to use index

Question

This is the schema of task_statuses table

                         Table "public.task_statuses"
     Column     |            Type             | Collation | Nullable | Default 
----------------+-----------------------------+-----------+----------+---------
 id             | uuid                        |           | not null | 
 updated_at     | timestamp without time zone |           | not null | 
 status         | task_status                 |           | not null | 
 status_details | json                        |           |          | 
Indexes:
    "task_statuses_id_updated_at_key" UNIQUE CONSTRAINT, btree (id, updated_at)
    "idx_task_statuses_id_updated_at" btree (id, updated_at)
Foreign-key constraints:
    "task_statuses_id_fkey" FOREIGN KEY (id) REFERENCES tasks(id) ON DELETE CASCADE

The table however is huge and has 150GB of data in production. I am trying to run an extremely simple query

 SELECT 
            ts.id
        FROM task_statuses ts
        WHERE 
            ts.status IN ('succeeded', 'failed', 'cancelled') 
        ORDER BY ts.id, ts.updated_at desc LIMIT 1000

It keeps timing out in production. When I remove ORDER BY the query runs successfully. Since, I have index in id and udpated_at, I am not sure why order by is timing out.

explain analyse times out as well.

Here is the explain for the above query.

Limit  (cost=10651159.84..10651276.51 rows=1000 width=24)
  ->  Gather Merge  (cost=10651159.84..10744721.60 rows=801902 width=24)
        Workers Planned: 2
        ->  Sort  (cost=10650159.81..10651162.19 rows=400951 width=24)
              Sort Key: id, updated_at DESC
              ->  Parallel Seq Scan on task_statuses ts  (cost=0.00..10628176.10 rows=400951 width=24)
                    Filter: (status = ANY ('{succeeded,failed,cancelled}'::task_status[]))

Query plan without order by:

https://explain.depesz.com/s/CfIU

Helpful links:

Suggestions or help would be much appreciated.

J.D. · Accepted Answer · 2020-12-29 13:21:19Z

2

Your costs are on your WHERE predicate for ts.status. You can see in the explain it's doing a Seq Scan for 400,951 rows with a cost of 10,628,176.10.

While having an index that is based on the ORDER BY fields in a query can help performance with the sorting, generally you should focus more on indexing based on your predicates (JOIN, WHERE, and HAVING clauses) because it won't have to do a Sequential Scan rather it can use the index to scan or seek even.

In this case if you had an index on the status column instead, your performance would likely be better (regardless sorting on your ORDER BY clause).

The difference in performance you're currently seeing is probably a difference in query plan between when you use and remove the ORDER BY clause that happens to be more efficient altogether. If you ran an explain for the query without the ORDER BY clause, I'm sure you'd see different operations occuring. But again, proper indexing on the status field should give you consistency in performance, either way.

edited Dec 29, 2020 at 13:21

answered Dec 29, 2020 at 13:15

J.D.

41.1k12 gold badges64 silver badges145 bronze badges

Ironically, removing ORDER BY made postgres use the index. explain.depesz.com/s/CfIU

Surya
– Surya

2020-12-29 13:23:27 +00:00
Commented Dec 29, 2020 at 13:23
1

@Surya Yes in that case that's why it's more efficient, but it's also using not the best index because the index it's using is on id, updated_at when you're filtering on the status field. You should create an index on status and verify the explain shows that new index being used in the query plan, regardless if you are using ORDER BY. That theoretically should be more performant for you in both cases.

J.D.
– J.D.

2020-12-29 13:30:06 +00:00
Commented Dec 29, 2020 at 13:30

Add a comment |

jjanes · Accepted Answer · 2020-12-30 02:37:21Z

1

Your index (well, both of them, it is not clear why you have two of them which differ only in UNIQUE) is on (id, created_at), but your ORDER BY is on id, created_at DESC.

PostgreSQL can follow an index forward, and follow it backwards, but it won't follow one inside out and sideways. Make an index on (id, created_at DESC) if you want to have the best hope of following the index to obtain an ordering.

BUT in the newest version of PostgreSQL (v13), there is an "incremental sort", which could use the existing index to get rows in order by id, then re-sort within each group of ties by id to put them in order by created_at DESC. If each group of ties by id is small, this could be pretty efficient.

answered Dec 30, 2020 at 2:37

jjanes

42.6k3 gold badges44 silver badges54 bronze badges

That makes sense would try it out.

Surya
– Surya

2021-03-16 13:49:23 +00:00
Commented Mar 16, 2021 at 13:49
Also, thanks for noticing the duplicate index. Fixing that as well.

Surya
– Surya

2021-03-16 13:49:57 +00:00
Commented Mar 16, 2021 at 13:49

Add a comment |

Stack Exchange Network

Postgres: Performance Issue: Query on enormous data fails to use index

2 Answers 2

Your Answer

Hot Network Questions

Postgres: Performance Issue: Query on enormous data fails to use index

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions