2

I have two postgres databases (one for development, one for test). Both have the same structure.

I'm running this query on both (merges one table into another)

WITH moved_rows AS 
( 
    DELETE FROM mybigtable_20220322 
    RETURNING * 
) 
INSERT INTO mybigtable_202203 
SELECT * FROM moved_rows; 

But I get slightly different results for EXPLAIN on each version.

Development (Postgres 13.1) -

Insert on mybigtable_202203  (cost=363938.39..545791.17 rows=9092639 width=429)
  CTE moved_rows
    ->  Delete on mybigtable_20220322  (cost=0.00..363938.39 rows=9092639 width=6)
          ->  Seq Scan on mybigtable_20220322  (cost=0.00..363938.39 rows=9092639 width=6)
  ->  CTE Scan on moved_rows  (cost=0.00..181852.78 rows=9092639 width=429)

Test (Postgres 14.1) -

Insert on mybigtable_202203  (cost=372561.91..558377.73 rows=0 width=0)
  CTE moved_rows
    ->  Delete on mybigtable_20220322  (cost=0.00..372561.91 rows=9290791 width=6)
          ->  Seq Scan on mybigtable_20220322  (cost=0.00..372561.91 rows=9290791 width=6)
  ->  CTE Scan on moved_rows  (cost=0.00..185815.82 rows=9290791 width=429)

The big difference is the first line, on Development I get rows=9092639 width=429 on Test I get rows=0 width=0

All the tables have the same definitions, with the same indexes (not that they seem to be used) the query succeeds on both databases, the EXPLAIN indicates similar costs on both database, and the tables on each database have a similar record count (just over 9 million rows)

In practice the difference is that on Development the query takes a few minutes, on Test is takes a few hours.

Both databases were created with the same scripts, so should be 100% identical my guess is there's some small, subtle difference that's crept in somewhere. Any suggestion on what the difference might be or how to find it? Thanks

Update

  • both the tables being merged (on both databases) have been VACUUM ANALYZED in similar timeframes.
  • I used fc to compare both DBs. There was ONE difference, on the development database the table was clustered on one of the indexes. I did similar clustering on the test table but results didn't change.
  • In response to the comment 'the plans are the same, only the estimated rows are different'. This difference is the only clue I currently have to an underlying problem. My development database is on a 10 year old server struggling with lack of resources, my test database is a brand new server. The former takes a few minutes to actually run the query the later takes a few hours. Whenever I post a question on the forum I'm always told 'start with the explain plan'
7
  • 2
    VACUUM ANALYZE; on the test db to get some usable statistics. Commented Mar 25, 2022 at 12:47
  • If you want to compare the two data models: pg_dump --schema-only ... >outputXY on both databases, and diff the resulting outputX and outputY Commented Mar 25, 2022 at 13:01
  • 2
    If the databases have different volume of rows or different data distribution it's normal to get different execution plans, even if the stats are up to date (using VACUUM). Commented Mar 25, 2022 at 13:09
  • If you have identical query plans but different execution times, there could be a difference in (absence of) indexes, triggers, or deferred constraints. Commented Mar 25, 2022 at 13:12
  • Having started with the explain plan, we can quickly see the differences are merely cosmetic. (For the reason pinpointed by Laurenz). No one promised you you can end with the explain plan. A good next step would be to sample the wait_event column of pg_stat_activity while the slow one is running, and see if there is a predominant wait event. Commented Mar 25, 2022 at 15:57

2 Answers 2

5

This change was made with commit f0f13a3a08b27, which didn't make it into the release notes, since it was considered a bug fix:

Fix estimates for ModifyTable paths without RETURNING.

In the past, we always estimated that a ModifyTable node would emit the same number of rows as its subpaths. Without a RETURNING clause, the correct estimate is zero. Fix, in preparation for a proposed parallel write patch that is sensitive to that number.

A remaining problem is that for RETURNING queries, the estimated width is based on subpath output rather than the RETURNING tlist.

Reviewed-by: Greg Nancarrow [email protected]
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com

This change only affects EXPLAIN output, not what actually happens during data modifications.

Sign up to request clarification or add additional context in comments.

Comments

0

To address the different query times you got in these two scenarios.

It is useful to consider following things when dealing with queries like yours (bulk delete and move)

  • the existence of triggers can make your query take much longer.
  • the existence of indexes will make the query take longer.
  • the existence of foreign keys leading to your table can make the query take much much longer.

It is best to drop all of those things before starting query like that. In my case I had foreign key in the table that was referencing said table (response kind of column). After dropping it query went from > 4 hours to < 30 seconds.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.