Lets say there is a table customers, it has following columns:
customer_id as primary key
creation_date
with few other columns
I want to query all the entries from customers table with creation_date >= to_date('2000/01/01', 'YYYY/MM/DD') and insert into another table clients. The number of entries in table customers around 10 million and both the tables are on the same database. I want to minimize the time for copying this data.
There are two approaches:
Run following query in parallel for each n, with n = 1, 10001, 20001... so on
insert into clients values ( select * from ( select *, row_number() over (ORDER BY customer_id) as rn from ( select * from customers where creation_date >= to_date('2000/01/01', 'YYYY/MM/DD') ) as sub1 ) as sub2 where rn>=n limit 10000 );Run single query
insert into clients values ( select * from customers where creation_date >= to_date('2000/01/01', 'YYYY/MM/DD') );
For 1, following is the execution plan
Insert on clients (cost=0.42..113.55 rows=10000 width=4506)
-> Subquery Scan on "*SELECT*" (cost=0.42..113.55 rows=10000 width=4506)
-> Limit (cost=0.42..112.45 rows=10000 width=616)
-> Subquery Scan on sub2 (cost=0.42..57773.19 rows=1000000 width=616)
Filter: (sub2.rn >= 0)
-> WindowAgg (cost=0.42..57579.79 rows=1000000 width=624)
-> Index Scan using “customer_id_pkey" on customers (cost=0.42..57347.71 rows=1000000 width=616)
Filter: ((creation_date >= to_date('2000/01/01'::text, 'YYYY/MM/DD'::text))
For all n execution in parallel, Subquery Scan on sub2 will run on the entire data. Hence doing it in parallel has defeated the purpose.
Is my understanding correct or will approach 1 will take less time than approach 2.
Also Please suggest if there is a way to improve the query performance.
[Values of plan node has been modified, so please ignore rows, cost and width values]
insertbased on a select remove thevaluesclause - that only allows for a single row.