Postgres query performance using row_number()

Question

Lets say there is a table customers, it has following columns:

customer_id as primary key
creation_date
with few other columns

I want to query all the entries from customers table with creation_date >= to_date('2000/01/01', 'YYYY/MM/DD') and insert into another table clients. The number of entries in table customers around 10 million and both the tables are on the same database. I want to minimize the time for copying this data.

There are two approaches:

Run following query in parallel for each n, with n = 1, 10001, 20001... so on

insert into clients 
values ( 
  select * 
  from ( 
    select *, row_number() over (ORDER BY customer_id) as rn 
    from ( 
      select * 
      from customers 
      where creation_date >= to_date('2000/01/01', 'YYYY/MM/DD')
    ) as sub1
  ) as sub2  
  where rn>=n limit 10000
);

Run single query

insert into clients 
values ( 
  select * 
  from customers 
  where creation_date >= to_date('2000/01/01', 'YYYY/MM/DD')
);

For 1, following is the execution plan

    Insert on clients  (cost=0.42..113.55 rows=10000 width=4506)
       ->  Subquery Scan on "*SELECT*"  (cost=0.42..113.55 rows=10000 width=4506)
             ->  Limit  (cost=0.42..112.45 rows=10000 width=616)
                   ->  Subquery Scan on sub2  (cost=0.42..57773.19 rows=1000000 width=616)
                         Filter: (sub2.rn >= 0)
                         ->  WindowAgg  (cost=0.42..57579.79 rows=1000000 width=624)
                               ->  Index Scan using “customer_id_pkey" on customers (cost=0.42..57347.71 rows=1000000 width=616)
                                     Filter: ((creation_date >= to_date('2000/01/01'::text, 'YYYY/MM/DD'::text))

For all n execution in parallel, Subquery Scan on sub2 will run on the entire data. Hence doing it in parallel has defeated the purpose.

Is my understanding correct or will approach 1 will take less time than approach 2.

Also Please suggest if there is a way to improve the query performance.

[Values of plan node has been modified, so please ignore rows, cost and width values]

Database size is relative. 10 million isn't necessarily big, if your server is reasonably powerful, or if the load is low, or if the record size is smaller. If any of those are true, you may be really overthinking this. If none of those are true, you might want more resources for your DB server. — Joel Coehoorn
– Joel Coehoorn, Commented Mar 18, 2018 at 1:49
The inserts won't work if the query returns more than one row. If you want to use insert based on a select remove the values clause - that only allows for a single row. — user330315
– user330315, Commented Mar 18, 2018 at 6:19

Gordon Linoff · Accepted Answer · 2018-03-18 00:22:47Z

-1

This is more of an extended comment.

You should really try the different approaches to see which works best. Personally, I would pretty much go with the second approach, although I would write it as:

insert into clients ( . . .)
    select c.*
    from customers c
    where creation_date >= date '2000-01-01';

Why? Doing multiple inserts into the table generates table contention, log contention, and possibly index contention. That is likely to slow things down. I'm not saying it will -- just that instead of thinking about it, I'd just start the query and wait however long it takes (a cup of coffee? lunch? over night?)

If your data is really big, then you probably want to be considering partitioning the data. If this is the case, then multiple simultaneous inserts are a good idea -- as long as they all go into a single partition.

In your first version, by the way, you don't need to use row_number(). You can use the offset/fetch syntax to get different pieces of the data.

edited Mar 18, 2018 at 0:22

answered Mar 17, 2018 at 15:51

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Xephonia Over a year ago

How is query in approach 2 is different from what you are mentioning ?

dnoeth Over a year ago

@Geeta: It's not different, it's just another way to specify a date. Of course the preferred version should be a Standard SQL Date literal: DATE '2000-01-01' :-)

Xephonia Over a year ago

Thanks, i have one more questions here, so one way to partition data is using offset and limit and running inserts in parallel, But to do that i will have to get total count of the entries, that will scan entire table. Another way is to use indexed column, say creation date, and based on it, split data across threads for parallel inserts. Will this solution performs better?

Collectives™ on Stack Overflow

Postgres query performance using row_number()

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related