0

Lets say there is a table customers, it has following columns:

  • customer_id as primary key

  • creation_date

  • with few other columns

I want to query all the entries from customers table with creation_date >= to_date('2000/01/01', 'YYYY/MM/DD') and insert into another table clients. The number of entries in table customers around 10 million and both the tables are on the same database. I want to minimize the time for copying this data.

There are two approaches:

  1. Run following query in parallel for each n, with n = 1, 10001, 20001... so on

    insert into clients 
    values ( 
      select * 
      from ( 
        select *, row_number() over (ORDER BY customer_id) as rn 
        from ( 
          select * 
          from customers 
          where creation_date >= to_date('2000/01/01', 'YYYY/MM/DD')
        ) as sub1
      ) as sub2  
      where rn>=n limit 10000
    );
    
  2. Run single query

    insert into clients 
    values ( 
      select * 
      from customers 
      where creation_date >= to_date('2000/01/01', 'YYYY/MM/DD')
    );
    

For 1, following is the execution plan

    Insert on clients  (cost=0.42..113.55 rows=10000 width=4506)
       ->  Subquery Scan on "*SELECT*"  (cost=0.42..113.55 rows=10000 width=4506)
             ->  Limit  (cost=0.42..112.45 rows=10000 width=616)
                   ->  Subquery Scan on sub2  (cost=0.42..57773.19 rows=1000000 width=616)
                         Filter: (sub2.rn >= 0)
                         ->  WindowAgg  (cost=0.42..57579.79 rows=1000000 width=624)
                               ->  Index Scan using “customer_id_pkey" on customers (cost=0.42..57347.71 rows=1000000 width=616)
                                     Filter: ((creation_date >= to_date('2000/01/01'::text, 'YYYY/MM/DD'::text)) 

For all n execution in parallel, Subquery Scan on sub2 will run on the entire data. Hence doing it in parallel has defeated the purpose.

Is my understanding correct or will approach 1 will take less time than approach 2.

Also Please suggest if there is a way to improve the query performance.

[Values of plan node has been modified, so please ignore rows, cost and width values]

2
  • Database size is relative. 10 million isn't necessarily big, if your server is reasonably powerful, or if the load is low, or if the record size is smaller. If any of those are true, you may be really overthinking this. If none of those are true, you might want more resources for your DB server. Commented Mar 18, 2018 at 1:49
  • The inserts won't work if the query returns more than one row. If you want to use insert based on a select remove the values clause - that only allows for a single row. Commented Mar 18, 2018 at 6:19

1 Answer 1

-1

This is more of an extended comment.

You should really try the different approaches to see which works best. Personally, I would pretty much go with the second approach, although I would write it as:

insert into clients ( . . .)
    select c.*
    from customers c
    where creation_date >= date '2000-01-01';

Why? Doing multiple inserts into the table generates table contention, log contention, and possibly index contention. That is likely to slow things down. I'm not saying it will -- just that instead of thinking about it, I'd just start the query and wait however long it takes (a cup of coffee? lunch? over night?)

If your data is really big, then you probably want to be considering partitioning the data. If this is the case, then multiple simultaneous inserts are a good idea -- as long as they all go into a single partition.

In your first version, by the way, you don't need to use row_number(). You can use the offset/fetch syntax to get different pieces of the data.

Sign up to request clarification or add additional context in comments.

3 Comments

How is query in approach 2 is different from what you are mentioning ?
@Geeta: It's not different, it's just another way to specify a date. Of course the preferred version should be a Standard SQL Date literal: DATE '2000-01-01' :-)
Thanks, i have one more questions here, so one way to partition data is using offset and limit and running inserts in parallel, But to do that i will have to get total count of the entries, that will scan entire table. Another way is to use indexed column, say creation date, and based on it, split data across threads for parallel inserts. Will this solution performs better?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.