0

I just noticed Postgres (checked on version 13 and 14) behavior that surprised me. I have a simple table volume with id and unique text column name. Second table dir has 3 columns: id, volume_id and path. This table is partitioned by hash on volume_id column. Here is full table schema with sample data:

CREATE TABLE dir (
   id BIGSERIAL,
   volume_id BIGINT,
   path TEXT
) PARTITION BY HASH (volume_id);

CREATE TABLE dir_0
    PARTITION OF dir FOR VALUES WITH (modulus 3, remainder 0);
CREATE TABLE dir_1
    PARTITION OF dir FOR VALUES WITH (modulus 3, remainder 1);
CREATE TABLE dir_2
    PARTITION OF dir FOR VALUES WITH (modulus 3, remainder 2);


CREATE TABLE volume(
    id BIGINT,
    name TEXT UNIQUE
);

INSERT INTO volume (id, name) VALUES (1, 'vol1'), (2, 'vol2'), (3, 'vol3');
INSERT INTO dir (volume_id, path) SELECT i % 3 + 1, 'path_' || i FROM generate_series(1,1000) AS i;

Now, given the volume name, I need to find all the rows from the dir table on that volume. I can do that in 2 different ways. Query #1

EXPLAIN ANALYZE 
SELECT * FROM dir AS d
    INNER JOIN volume AS v ON d.volume_id = v.id
    WHERE v.name = 'vol1';

Which produces query plan:

QUERY PLAN
Hash Join (cost=1.05..31.38 rows=333 width=37) (actual time=0.186..0.302 rows=333 loops=1)
Hash Cond: (d.volume_id = v.id)
-> Append (cost=0.00..24.00 rows=1000 width=24) (actual time=0.006..0.154 rows=1000 loops=1)
-> Seq Scan on dir_0 d_1 (cost=0.00..6.34 rows=334 width=24) (actual time=0.006..0.032 rows=334 loops=1)
-> Seq Scan on dir_1 d_2 (cost=0.00..6.33 rows=333 width=24) (actual time=0.006..0.029 rows=333 loops=1)
-> Seq Scan on dir_2 d_3 (cost=0.00..6.33 rows=333 width=24) (actual time=0.004..0.026 rows=333 loops=1)
-> Hash (cost=1.04..1.04 rows=1 width=13) (actual time=0.007..0.007 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on volume v (cost=0.00..1.04 rows=1 width=13) (actual time=0.003..0.004 rows=1 loops=1)
Filter: (name = 'vol1'::text)
Rows Removed by Filter: 2
Planning Time: 0.500 ms
Execution Time: 0.364 ms

As you can see this query leads to a sequential scan on all 3 partitions of the dir table.


Alternatively, we can write this query like this:

Query #2

EXPLAIN ANALYZE 
SELECT * FROM dir AS d
    WHERE volume_id = (SELECT id FROM volume AS v WHERE v.name = 'vol1');

In that case we get following query plan:

QUERY PLAN
Append (cost=1.04..27.54 rows=1000 width=24) (actual time=0.010..0.066 rows=333 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on volume v (cost=0.00..1.04 rows=1 width=8) (actual time=0.003..0.004 rows=1 loops=1)
Filter: (name = 'vol1'::text)
Rows Removed by Filter: 2
-> Seq Scan on dir_0 d_1 (cost=0.00..7.17 rows=334 width=24) (never executed)
Filter: (volume_id = $0)
-> Seq Scan on dir_1 d_2 (cost=0.00..7.16 rows=333 width=24) (never executed)
Filter: (volume_id = $0)
-> Seq Scan on dir_2 d_3 (cost=0.00..7.16 rows=333 width=24) (actual time=0.004..0.037 rows=333 loops=1)
Filter: (volume_id = $0)
Planning Time: 0.063 ms
Execution Time: 0.093 ms

Here we can see that partitions dir_0 and dir_1 have never executed annotation.


View on DB Fiddle

My question is: Why in the first case is there no partition pruning? Postgres already knows that the volume.name column is unique and that it will translate into a single volume_id. I would like to get a good intuition on when partition pruning can happen during query execution.

1 Answer 1

0

To get partition pruning with a hash join, you'd need to add a condition on d.volume_id to your query. No inference is made from the join with volume.

Your second query shows partition pruning; the "never executed" means that the query executor pruned the scan of certain partitions.

An alternative method that forces a nested loop join and prunes partitions should be

SELECT *
FROM volume AS v
   CROSS JOIN LATERAL (SELECT * FROM dir
                       WHERE dir.volume_id = v.id
                       OFFSET 0) AS d
WHERE v.name = 'vol1';

OFFSET 0 prevents anything except a nested loop join. Different from your query, that should also work if volumn.name is not unique.

Sign up to request clarification or add additional context in comments.

6 Comments

I guess that the main takeaway is that there must be a condition on the partition column. Unfortunately, your proposition with lateral join doesn't work (at least for that example) and produces hash join with scan on all partitions.
Do you know if there are any plans for making such queries use partition pruning?
Ah. Reading your question again, I see that your second query already does get partition pruning, so you should be good. I am surprised that my query doesn't do the same thing; could you show EXPLAIN (ANALYZE) output for my query?
Here is a link to DB Fiddle if you want to play with it: db-fiddle.com/f/swihQpFqUHhBujfoLK9JNe/4
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.