I'm testing how join works with hash index in PostgreSQL 16.2. Here is a test table. Just 2 columns with numbers in text format.
create table join_test (
pk varchar(20),
fk varchar(20));
insert into join_test(pk, fk)
select s::varchar(20),
(10000001 - s)::varchar(20)
from generate_series(1, 10000000) as s;
Then I do simple join.
explain analyze select *
from join_test t1
join join_test t2 on t1.pk = t2.fk;
Hash Join (cost=327879.85..878413.95 rows=9999860 width=28) (actual time=5181.056..18337.596 rows=10000000 loops=1)
Hash Cond: ((t1.pk)::text = (t2.fk)::text)
-> Seq Scan on join_test t1 (cost=0.00..154053.60 rows=9999860 width=14) (actual time=0.070..1643.618 rows=10000000 loops=1)
-> Hash (cost=154053.60..154053.60 rows=9999860 width=14) (actual time=5147.801..5147.803 rows=10000000 loops=1)
Buckets: 262144 Batches: 128 Memory Usage: 5691kB
-> Seq Scan on join_test t2 (cost=0.00..154053.60 rows=9999860 width=14) (actual time=0.024..2163.714 rows=10000000 loops=1)
Planning Time: 0.172 ms
Execution Time: 18718.586 ms
No surprises here, without indexes there is a Hash Join with hash table construction on the fly.
Then I add a hash index on "pk" column. And do same join again.
Nested Loop (cost=0.00..776349.75 rows=9999860 width=28) (actual time=0.107..85991.520 rows=10000000 loops=1)
-> Seq Scan on join_test t2 (cost=0.00..154053.60 rows=9999860 width=14) (actual time=0.062..1399.400 rows=10000000 loops=1)
-> Index Scan using join_test_pk_idx on join_test t1 (cost=0.00..0.05 rows=1 width=14) (actual time=0.008..0.008 rows=1 loops=10000000)
Index Cond: ((pk)::text = (t2.fk)::text)
Rows Removed by Index Recheck: 0
Planning Time: 0.195 ms
Execution Time: 86490.687 ms
As I understand it, in this particular case, Nested Loop is not really a "nested loop" and rather algorithmically the same as Hash Join, except that it uses an already constructed hash table from the index.
- In theory, a query with an index should work faster than without. But in reality it works much worse. I wonder why?
- Does it even make sense to use hash index for join, or I should always use btree?