Optimize query to insert random rows from tableA into tableB, without duplicates

Question

I am still a novice to SQL, using PostgreSQL 16.2.

create table users(
    user_id serial primary key
);
create table task_pool(
    task_id serial primary key,   
    order_id integer,
    clicker_id integer,
    INDEX idx_order_id (order_id)
);
create table task_dispersal(
    id serial primary key,
    order_id integer,
    to_disp integer
);

There are 100,000 rows in users (so 100,000 user_ids);
100 rows in task_dispersal, each with to_disp as 10.
Starting with 0 rows in task_pool.

I want to iterate through rows in task_dispersal. For each row...

Get to_disp amount of random user_id's from users table, that are not already in task_pool with iterated row's related order_id.
For example if the a row in task_dispersal had order_id 1 and to_disp 10, it would check task_pool for all rows that have order_id 1, get the user_id's from those rows and filter them from the users table, and then choose 10 random rows (user_id's) to insert into task_pool.

I am using this query:

INSERT into task_pool(clicker_id, order_id)
SELECT U.user_id, TD.order_id
FROM task_dispersal TD
CROSS JOIN LATERAL (
    SELECT Us.user_id
    FROM users Us
    WHERE user_id NOT IN (
        SELECT clicker_id
        FROM task_pool
        WHERE order_id = TD.order_id
    )
    ORDER BY RANDOM()
    LIMIT TD.to_disp
) U

It works, and relatively fast, but I am still hoping there is a way to optimize this, because it takes 26-27 seconds on my (very weak) hosted database, and I am trying for around a few seconds.

Here is the query plan:

"Insert on public.task_pool  (cost=6586.23..674875.51 rows=0 width=0) (actual time=26414.647..26414.649 rows=0 loops=1)"
"  Buffers: shared hit=33860 dirtied=22 written=22"
"  ->  Nested Loop  (cost=6586.23..674875.51 rows=500000 width=24) (actual time=1815.128..26402.217 rows=1000 loops=1)"
"        Output: nextval('task_pool_task_id_seq'::regclass), td.order_id, us.user_id, CURRENT_TIMESTAMP, 0"
"        Buffers: shared hit=28619"
"        ->  Seq Scan on public.task_dispersal td  (cost=0.00..3.00 rows=100 width=8) (actual time=0.008..0.120 rows=100 loops=1)"
"              Output: td.id, td.order_id, td.reset_time, td.disp_remaining, td.daily_disp, td.disp_interval, td.next_disp_time, td.expired_tasks, td.to_disp"
"              Buffers: shared hit=2"
"        ->  Limit  (cost=6586.23..6598.73 rows=5000 width=12) (actual time=251.784..251.786 rows=10 loops=100)"
"              Output: us.user_id, (random())"
"              Buffers: shared hit=27607"
"              ->  Sort  (cost=6586.23..6711.23 rows=50000 width=12) (actual time=251.781..251.782 rows=10 loops=100)"
"                    Output: us.user_id, (random())"
"                    Sort Key: (random())"
"                    Sort Method: top-N heapsort  Memory: 25kB"
"                    Buffers: shared hit=27607"
"                    ->  Index Only Scan using users_pkey on public.users us  (cost=2.96..2181.57 rows=50000 width=12) (actual time=3.048..139.824 rows=100000 loops=100)"
"                          Output: us.user_id, random()"
"                          Filter: (NOT (hashed SubPlan 1))"
"                          Heap Fetches: 0"
"                          Buffers: shared hit=27607"
"                          SubPlan 1"
"                            ->  Index Scan using idx_order_id on public.task_pool task_pool_1  (cost=0.15..2.63 rows=16 width=4) (actual time=0.022..0.022 rows=0 loops=100)"
"                                  Output: task_pool_1.clicker_id"
"                                  Index Cond: (task_pool_1.order_id = td.order_id)"
"                                  Buffers: shared hit=106"
"Settings: effective_io_concurrency = '200', random_page_cost = '1.1', effective_cache_size = '192MB', max_parallel_workers = '1', max_parallel_workers_per_gather = '1', work_mem = '1703kB'"
"Query Identifier: 3069500943293269296"
"Planning:"
"  Buffers: shared hit=40 read=3"
"Planning Time: 0.317 ms"
"JIT:"
"  Functions: 22"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 1.757 ms, Inlining 208.912 ms, Optimization 913.724 ms, Emission 399.368 ms, Total 1523.761 ms"
"Execution Time: 26416.272 ms"

I don't really know how to read the query plan, but I am pretty sure that using ORDER BY RANDOM() on the result of user_ids that are not in task_pool with order_id is the most costly?

I don't think I can use tablesample bernoulli to get random users from the users table though; as some/all the sample rows it chooses might already be in the task_pool, so it will insert less than the required amount(to_disp) of user_ids (this happens fairly often after performing the query a few times)

If I could use tablesample bernoulli on the RESULT of

WHERE user_id NOT IN (
        SELECT clicker_id
        FROM task_pool
        WHERE order_id = TD.order_id
    )

That would be awesome; unfortunately it can only be used on tables, temp tables, or materialized views, and can't be used after WHERE.

I have tried making temp tables of the unused user_ids per row and then tablesample bernoulli (very slow when each row has up to 100,000 unused user_ids)

At this point, I don't think there is any optimization that can be done, other than upgrading the server. Any better ideas?

Can there be multiple inserts concurrently? Each user can be inserted in task_pool once for every distinct order_id? So task_pool(clicker_id, order_id) is unique? Also defined as unique? — Erwin Brandstetter
– Erwin Brandstetter, Commented Apr 6, 2024 at 22:36
Your query makes it seem like all rows in task_dispersal are processed at once. Is that so? By a single query? And then the job is done? Also, is task_dispersal.order_id unique? Defined as unique? — Erwin Brandstetter
– Erwin Brandstetter, Commented Apr 6, 2024 at 22:42
yes, that is how it is now, they are all done at once, iterating and joined together using cross join lateral. would it be better to do each row separately, per query? yes theyre can be inserts concurrently. yes each user can be inserted once for each order id. there shouldn't be more than 1000 distinct order id's. This is the max. and yes, I know it is pre optimization, but this is to test the hosted database. and yes, each task_dispersal is unique. it's not defined as unique but they are. would defining them as unique help? — EyeDunno
– EyeDunno, Commented Apr 6, 2024 at 22:57
Oh, and I updated the question. WHile the max order_id's in task_dispersal may be 1000, in this test query there is only 100, NOT 1000. — EyeDunno
– EyeDunno, Commented Apr 6, 2024 at 23:10

Nick · Accepted Answer · 2024-04-06 22:35:20Z

0

You may try to remove one subquery:

INSERT INTO task_pool(clicker_id, order_id)
SELECT U.user_id, TD.id
FROM task_dispersal TD
CROSS JOIN LATERAL (
    SELECT Us.user_id
    FROM users Us
    LEFT JOIN task_pool ON task_pool.clicker_id = Us.user_id and order_id = TD.id
    WHERE task_id IS NULL
    ORDER BY RANDOM()
    LIMIT to_disp
) U;

answered Apr 6, 2024 at 22:35

Nick

10.2k8 gold badges65 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

EyeDunno Over a year ago

Unfortunately this query too a bit longer to complete (~33.5 seconds vs 27 seconds) Here is the query plan: pastebin.com/JxHg2X4r

Erwin Brandstetter · Accepted Answer · 2024-04-07 00:31:27Z

0

The best solution depends on your flavor of "random". And more undisclosed details. For the given case, this should be faster by orders of magnitude:

INSERT INTO task_pool (order_id, clicker_id)
SELECT d.order_id, u.user_id
FROM  (
   SELECT row_number() OVER () AS rn, d.order_id
   FROM   task_dispersal d, generate_series (1, d.to_disp) x
   ) d
JOIN  (
   SELECT row_number() OVER (ORDER  BY random()) AS rn, u.user_id
   FROM   users u
   LIMIT (SELECT sum(to_disp) FROM task_dispersal)
   ) u USING (rn);

This gives every user_id absolutely equal chance to be chosen for any order. Random order in one of the joined derived tables does the job. But no user is chosen twice across all orders. (!)
The number of available users must be greater than the sum of all orders. Fits your given use case ...

This reads rows from the table users once and does not check for dupes at all, since there are no concurrent writes and the target table starts out empty. Hence crazy fast.

fiddle

edited Apr 7, 2024 at 0:31

answered Apr 6, 2024 at 23:51

Erwin Brandstetter

668k159 gold badges1.2k silver badges1.3k bronze badges

6 Comments

EyeDunno Over a year ago

Hmm. It of course works on fiddle, however it is inserting 0 rows when I try on my database. I have left out something that I am not realizing is causeing an issue, I will look into this and try to simulate the tables made on fiddle. In the meantime- would this query work if the task_pool table DID already have rows in it (like 100 rows, 10 rows with order_id 1-10 each with a random user_id)? Thank you, I will keep trying to learn more!

Erwin Brandstetter Over a year ago

That would make dupes possible, which you want to exclude, so no - if order_id's can overlap with incoming rows. But your question explicitly states: "Starting with 0 rows in task_pool." You need to be precise with these kinds of questions.

EyeDunno Over a year ago

Ahh, true. Sorry! I meant that as the baseline for the query I was testing. I will edit the question again

Erwin Brandstetter Over a year ago

But I won't change my answer any more. Changing the nature of the question after answers have been given is not the way to go. Please start a new question for a new question.

EyeDunno Over a year ago

Ok I understand. Thank you for taking the time; I will study this code and see if perhaps it can still be used on just every order_id in task_dispersal that has no user_ids in the task_pool! For example, making a CTE: WITH new_orders as (Select * From task_dispersal where user_id NOT IN (select distinct on (clicker_id) from task_pool). And then using new_orders instead of task_dispersal in your query.

|

Collectives™ on Stack Overflow

Optimize query to insert random rows from tableA into tableB, without duplicates

2 Answers 2

1 Comment

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related