Can anyone explain why PostgreSQL works so:
If I execute this query
SELECT
*
FROM project_archive_doc as PAD, project_archive_doc as PAD2
WHERE
PAD.id = PAD2.id
it will be simple JOIN and EXPLAIN will looks like this:
Hash Join (cost=6.85..13.91 rows=171 width=150)
Hash Cond: (pad.id = pad2.id)
-> Seq Scan on project_archive_doc pad (cost=0.00..4.71 rows=171 width=75)
-> Hash (cost=4.71..4.71 rows=171 width=75)
-> Seq Scan on project_archive_doc pad2 (cost=0.00..4.71 rows=171 width=75)
But if I will execute this query:
SELECT *
FROM project_archive_doc as PAD
WHERE
PAD.id = (
SELECT PAD2.id
FROM project_archive_doc as PAD2
WHERE
PAD2.project_id = PAD.project_id
ORDER BY PAD2.created_at
LIMIT 1)
there will be no joins and EXPLAIN looks like:
Seq Scan on project_archive_doc pad (cost=0.00..886.22 rows=1 width=75)"
Filter: (id = (SubPlan 1))
SubPlan 1
-> Limit (cost=5.15..5.15 rows=1 width=8)
-> Sort (cost=5.15..5.15 rows=1 width=8)
Sort Key: pad2.created_at
-> Seq Scan on project_archive_doc pad2 (cost=0.00..5.14 rows=1 width=8)
Filter: (project_id = pad.project_id)
Why it is so and is there any documentation or articles about this?
PAD2.project_id = PAD.project_idPLUS theLIMIT 1make the optimiser believe that very few rows will satisfy the condition (estimate=one row), so it chooses to extract exaxtly this row. Things may change if the table gets larger. That is: the plan could change, the result will always be one row. The optimiser is (mostly) always right...LIMIT 1inside a (scalar correlated) subquery looks awkward, IMHO)