Postgres row_number return arbitrary sort order [duplicate]

Question

I run

SELECT * 
FROM 
    (SELECT 
         *, 
         ROW_NUMBER() OVER () AS n 
     FROM 
         {table_name}) t 
WHERE 
    n < 10000

in Postgres. I've noticed the result is different for each run.

To test if content is different in addition to the order, I do an avg on a column. The result is interesting: table with primary key is consistent in return value, while another table without primary key differs in each run.

The execution plan for table with PK:

"Aggregate  (cost=139391585.22..139391585.23 rows=1 width=32)"
"  ->  WindowAgg  (cost=0.58..99288350.02 rows=3208258816 width=9090)"
"        Run Condition: (row_number() OVER (?) < 10000)"
"        ->  Index Only Scan using mea_vit_pi_4221ef4deeadcabf_ix on {table_name}  (cost=0.58..59185114.82 rows=3208258816 width=8)"

Execution plan for table without pk:

"Aggregate  (cost=83580303.64..83580303.65 rows=1 width=32)"
"  ->  WindowAgg  (cost=0.00..61837074.84 rows=1739458304 width=650)"
"        Run Condition: (row_number() OVER (?) < 10000)"
"        ->  Seq Scan on {table_2}  (cost=0.00..40093846.04 rows=1739458304 width=8)"

Why it's different? If possible, what should I do to get a stable result?

I did SELECT avg(person_id) FROM (SELECT *, ROW_NUMBER() OVER () AS n FROM {table_name}) t WHERE n<10000 to test if the content is actually different in addition to order. The result is interesting: the content seems to be same for table with primary key. But different for table without primary key. — willshen
– willshen, Commented Feb 14, 2024 at 0:18
Also ROW_NUMBER() without ORDER BY in the OVER clause is forbidden in the standard ISO SQL, because of inconsistent results. This is a gotcha in PostGreSQL that does not conform to the SQL standard whle it claims to be !.... — SQLpro
– SQLpro, Commented Feb 14, 2024 at 11:19

Erwin Brandstetter · Accepted Answer · 2024-02-14 01:10:00Z

ROW_NUMBER() OVER () (without ORDER BY) returns sequential numbers in arbitrary sort order, in whatever sequence Postgres happens to return rows. In your case, not only can the same row get a different number, different rows can be selected each time.

For small tables, the order typically sorts along the physical sort order, but that is unreliable! For bigger tables or more complex queries, a number of distortions kick in. Parallelism, caching effects, ...

In your particular test, Index Only Scan (on your PK index, presumably) vs. Seq Scan makes the difference. The index-only scan returns sorted rows. A sequential scan is free to grab the cheapest rows it can get in arbitrary order.

Add a deterministic ORDER BY clause to get deterministic numbers. Meaning, all ORDER BY expressions together must form unique values per selected row.

While being at it, replace the WHERE clause with a cheaper LIMIT:

SELECT *, row_number() OVER () AS n
FROM   tbl
ORDER  BY  -- your deterministic sort order here!
LIMIT  9999;

row_number() follows the order set in the ORDER BY of the same SELECT. But to be absolutely unambiguous:

SELECT *, row_number() OVER (ORDER  BY ... ) AS n  -- your deterministic sort order here!
FROM   tbl
ORDER  BY  ...  -- same here!
LIMIT  9999;

Both queries result in the same query plan - as currently implemented. Only the second is guaranteed.

Collectives™ on Stack Overflow

Postgres row_number return arbitrary sort order [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related