Why this query does not trigger partition pruning in Postgres?

Question

I just noticed Postgres (checked on version 13 and 14) behavior that surprised me. I have a simple table volume with id and unique text column name. Second table dir has 3 columns: id, volume_id and path. This table is partitioned by hash on volume_id column. Here is full table schema with sample data:

CREATE TABLE dir (
   id BIGSERIAL,
   volume_id BIGINT,
   path TEXT
) PARTITION BY HASH (volume_id);

CREATE TABLE dir_0
    PARTITION OF dir FOR VALUES WITH (modulus 3, remainder 0);
CREATE TABLE dir_1
    PARTITION OF dir FOR VALUES WITH (modulus 3, remainder 1);
CREATE TABLE dir_2
    PARTITION OF dir FOR VALUES WITH (modulus 3, remainder 2);


CREATE TABLE volume(
    id BIGINT,
    name TEXT UNIQUE
);

INSERT INTO volume (id, name) VALUES (1, 'vol1'), (2, 'vol2'), (3, 'vol3');
INSERT INTO dir (volume_id, path) SELECT i % 3 + 1, 'path_' || i FROM generate_series(1,1000) AS i;

Now, given the volume name, I need to find all the rows from the dir table on that volume. I can do that in 2 different ways. Query #1

EXPLAIN ANALYZE 
SELECT * FROM dir AS d
    INNER JOIN volume AS v ON d.volume_id = v.id
    WHERE v.name = 'vol1';

Which produces query plan:

QUERY PLAN
Hash Join (cost=1.05..31.38 rows=333 width=37) (actual time=0.186..0.302 rows=333 loops=1)
Hash Cond: (d.volume_id = v.id)
-> Append (cost=0.00..24.00 rows=1000 width=24) (actual time=0.006..0.154 rows=1000 loops=1)
-> Seq Scan on dir_0 d_1 (cost=0.00..6.34 rows=334 width=24) (actual time=0.006..0.032 rows=334 loops=1)
-> Seq Scan on dir_1 d_2 (cost=0.00..6.33 rows=333 width=24) (actual time=0.006..0.029 rows=333 loops=1)
-> Seq Scan on dir_2 d_3 (cost=0.00..6.33 rows=333 width=24) (actual time=0.004..0.026 rows=333 loops=1)
-> Hash (cost=1.04..1.04 rows=1 width=13) (actual time=0.007..0.007 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on volume v (cost=0.00..1.04 rows=1 width=13) (actual time=0.003..0.004 rows=1 loops=1)
Filter: (name = 'vol1'::text)
Rows Removed by Filter: 2

Planning Time: 0.500 ms
Execution Time: 0.364 ms

As you can see this query leads to a sequential scan on all 3 partitions of the dir table.

Alternatively, we can write this query like this:

Query #2

EXPLAIN ANALYZE 
SELECT * FROM dir AS d
    WHERE volume_id = (SELECT id FROM volume AS v WHERE v.name = 'vol1');

In that case we get following query plan:

QUERY PLAN
Append (cost=1.04..27.54 rows=1000 width=24) (actual time=0.010..0.066 rows=333 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on volume v (cost=0.00..1.04 rows=1 width=8) (actual time=0.003..0.004 rows=1 loops=1)
Filter: (name = 'vol1'::text)
Rows Removed by Filter: 2
-> Seq Scan on dir_0 d_1 (cost=0.00..7.17 rows=334 width=24) (never executed)
Filter: (volume_id = $0)
-> Seq Scan on dir_1 d_2 (cost=0.00..7.16 rows=333 width=24) (never executed)
Filter: (volume_id = $0)
-> Seq Scan on dir_2 d_3 (cost=0.00..7.16 rows=333 width=24) (actual time=0.004..0.037 rows=333 loops=1)
Filter: (volume_id = $0)
Planning Time: 0.063 ms
Execution Time: 0.093 ms

Here we can see that partitions dir_0 and dir_1 have never executed annotation.

View on DB Fiddle

My question is: Why in the first case is there no partition pruning? Postgres already knows that the volume.name column is unique and that it will translate into a single volume_id. I would like to get a good intuition on when partition pruning can happen during query execution.

Laurenz Albe · Accepted Answer · 2022-05-27 12:13:50Z

0

To get partition pruning with a hash join, you'd need to add a condition on d.volume_id to your query. No inference is made from the join with volume.

Your second query shows partition pruning; the "never executed" means that the query executor pruned the scan of certain partitions.

An alternative method that forces a nested loop join and prunes partitions should be

SELECT *
FROM volume AS v
   CROSS JOIN LATERAL (SELECT * FROM dir
                       WHERE dir.volume_id = v.id
                       OFFSET 0) AS d
WHERE v.name = 'vol1';

OFFSET 0 prevents anything except a nested loop join. Different from your query, that should also work if volumn.name is not unique.

edited May 27, 2022 at 12:13

answered May 27, 2022 at 11:00

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Michał Albrycht Over a year ago

I guess that the main takeaway is that there must be a condition on the partition column. Unfortunately, your proposition with lateral join doesn't work (at least for that example) and produces hash join with scan on all partitions.

Michał Albrycht Over a year ago

Do you know if there are any plans for making such queries use partition pruning?

Laurenz Albe Over a year ago

Ah. Reading your question again, I see that your second query already does get partition pruning, so you should be good. I am surprised that my query doesn't do the same thing; could you show EXPLAIN (ANALYZE) output for my query?

Michał Albrycht Over a year ago

Here you go: explain.depesz.com/s/PsIi

Michał Albrycht Over a year ago

Here is a link to DB Fiddle if you want to play with it: db-fiddle.com/f/swihQpFqUHhBujfoLK9JNe/4

|

Collectives™ on Stack Overflow

Why this query does not trigger partition pruning in Postgres?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related