Postgres doing a sort on simple join

Question

I have two tables in my database (address and person_address). Address has a PK in address_id. person_address has a PK on (address_id, person_id, usage_code)

When joining this two tables through the address_id, my expectation is that the PK index is used on both cases. However, Postgres is adding sort and materialize steps to the plan, which slows down the execution of the query. I have tried dropping indexes (person_address had an index on address_id), analyzing stats, without success.

I will appreciate any help on how to isolate this situation since those queries run slower than expected on our production environment

This is the query:

select * 
from person_addresses pa
join address a
on pa.address_id = a.address_id

This is the plan :

Merge Join  (cost=1506935.96..2416648.39 rows=16033774 width=338)
  Merge Cond: (pa.address_id = ((a.address_id)::numeric))
  ->  Index Scan using person_addresses_pkey on person_addresses pa  (cost=0.43..592822.76 rows=5256374 width=104)
  ->  Materialize  (cost=1506935.53..1526969.90 rows=4006874 width=234)
        ->  Sort  (cost=1506935.53..1516952.71 rows=4006874 width=234)
              Sort Key: ((a.address_id)::numeric)
              ->  Seq Scan on address a  (cost=0.00..163604.74 rows=4006874 width=234)

Thanks.

Edit 1. After the comment checked the data types and found a discrepancy. Fixing the data type changed the plan to the following

Hash Join  (cost=343467.18..881125.47 rows=5256374 width=348)
  Hash Cond: (pa.address_id = a.address_id)
  ->  Seq Scan on person_addresses pa  (cost=0.00..147477.74 rows=5256374 width=104)
  ->  Hash  (cost=159113.97..159113.97 rows=4033697 width=244)
        ->  Seq Scan on address_normalization a  (cost=0.00..159113.97 rows=4033697 width=244)

Performance improvement is evident on the plan, but am wondering if the sequential scans are expected without any filters

Column in person_addresses was numeric(15), on address it was bigint. I have changed address.address_id to be numeric(15). Now the plan looks like this Hash Join (cost=343467.18..881125.47 rows=5256374 width=348) Hash Cond: (pa.address_id = a.address_id) -> Seq Scan on person_addresses pa (cost=0.00..147477.74 rows=5256374 width=104) -> Hash (cost=159113.97..159113.97 rows=4033697 width=244) -> Seq Scan on address_normalization a (cost=0.00..159113.97 rows=4033697 width=244) Is a seq scan normal on these cases? Plan is more efficient it seems — Cesar Bolanos
– Cesar Bolanos, Commented Jul 26, 2018 at 16:08
"if the sequential scans are expected without any filters" - absolutely. If you want all rows, a Seq Scan is the obvious and fastest way to get them. The reason why the Merge Join was chose, was because the two data types wouldn't result in the same hash values and thus a Hash Join wasn't possible. — user330315
– user330315, Commented Jul 26, 2018 at 19:04

IVO GELOV · Accepted Answer · 2018-07-27 08:39:38Z

So there are two questions here:

why did Postgres choose the (expensive) "Merge Join" in the first query?

The reason for this is that it could not use the more efficient "Hash Join" because the hash values of integer and numeric values would be different. But the Merge join requires that the values are sorted, and that's where the "Sort" step comes from in the first execution plan. Given the number of rows a "Nested Loop" would have been even more expensive.

The second question is:

I am wondering if the sequential scans are expected without any filters

Yes they are expected. The query retrieves all matching rows from both tables and that is done most efficiently by scanning all rows. An index scan requires about 2-3 I/O operations per row that has to be retrieved. A sequential scan usually requires less than one I/O operation as one block (which is the smallest unit the database reads from the disk) contains multiple rows.

You can run explain (analyze, buffers) to see how much "logical reads" each step takes.

Collectives™ on Stack Overflow

Postgres doing a sort on simple join

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related