I am still a novice to SQL, using PostgreSQL 16.2.
create table users(
user_id serial primary key
);
create table task_pool(
task_id serial primary key,
order_id integer,
clicker_id integer,
INDEX idx_order_id (order_id)
);
create table task_dispersal(
id serial primary key,
order_id integer,
to_disp integer
);
There are 100,000 rows in users (so 100,000 user_ids);
100 rows in task_dispersal, each with to_disp as 10.
Starting with 0 rows in task_pool.
I want to iterate through rows in task_dispersal. For each row...
Get to_disp amount of random user_id's from
userstable, that are not already intask_poolwith iterated row's related order_id.For example if the a row in
task_dispersalhad order_id 1 and to_disp 10, it would checktask_poolfor all rows that have order_id 1, get the user_id's from those rows and filter them from the users table, and then choose 10 random rows (user_id's) to insert intotask_pool.
I am using this query:
INSERT into task_pool(clicker_id, order_id)
SELECT U.user_id, TD.order_id
FROM task_dispersal TD
CROSS JOIN LATERAL (
SELECT Us.user_id
FROM users Us
WHERE user_id NOT IN (
SELECT clicker_id
FROM task_pool
WHERE order_id = TD.order_id
)
ORDER BY RANDOM()
LIMIT TD.to_disp
) U
It works, and relatively fast, but I am still hoping there is a way to optimize this, because it takes 26-27 seconds on my (very weak) hosted database, and I am trying for around a few seconds.
Here is the query plan:
"Insert on public.task_pool (cost=6586.23..674875.51 rows=0 width=0) (actual time=26414.647..26414.649 rows=0 loops=1)"
" Buffers: shared hit=33860 dirtied=22 written=22"
" -> Nested Loop (cost=6586.23..674875.51 rows=500000 width=24) (actual time=1815.128..26402.217 rows=1000 loops=1)"
" Output: nextval('task_pool_task_id_seq'::regclass), td.order_id, us.user_id, CURRENT_TIMESTAMP, 0"
" Buffers: shared hit=28619"
" -> Seq Scan on public.task_dispersal td (cost=0.00..3.00 rows=100 width=8) (actual time=0.008..0.120 rows=100 loops=1)"
" Output: td.id, td.order_id, td.reset_time, td.disp_remaining, td.daily_disp, td.disp_interval, td.next_disp_time, td.expired_tasks, td.to_disp"
" Buffers: shared hit=2"
" -> Limit (cost=6586.23..6598.73 rows=5000 width=12) (actual time=251.784..251.786 rows=10 loops=100)"
" Output: us.user_id, (random())"
" Buffers: shared hit=27607"
" -> Sort (cost=6586.23..6711.23 rows=50000 width=12) (actual time=251.781..251.782 rows=10 loops=100)"
" Output: us.user_id, (random())"
" Sort Key: (random())"
" Sort Method: top-N heapsort Memory: 25kB"
" Buffers: shared hit=27607"
" -> Index Only Scan using users_pkey on public.users us (cost=2.96..2181.57 rows=50000 width=12) (actual time=3.048..139.824 rows=100000 loops=100)"
" Output: us.user_id, random()"
" Filter: (NOT (hashed SubPlan 1))"
" Heap Fetches: 0"
" Buffers: shared hit=27607"
" SubPlan 1"
" -> Index Scan using idx_order_id on public.task_pool task_pool_1 (cost=0.15..2.63 rows=16 width=4) (actual time=0.022..0.022 rows=0 loops=100)"
" Output: task_pool_1.clicker_id"
" Index Cond: (task_pool_1.order_id = td.order_id)"
" Buffers: shared hit=106"
"Settings: effective_io_concurrency = '200', random_page_cost = '1.1', effective_cache_size = '192MB', max_parallel_workers = '1', max_parallel_workers_per_gather = '1', work_mem = '1703kB'"
"Query Identifier: 3069500943293269296"
"Planning:"
" Buffers: shared hit=40 read=3"
"Planning Time: 0.317 ms"
"JIT:"
" Functions: 22"
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 1.757 ms, Inlining 208.912 ms, Optimization 913.724 ms, Emission 399.368 ms, Total 1523.761 ms"
"Execution Time: 26416.272 ms"
I don't really know how to read the query plan, but I am pretty sure that using ORDER BY RANDOM() on the result of user_ids that are not in task_pool with order_id is the most costly?
I don't think I can use tablesample bernoulli to get random users from the users table though; as some/all the sample rows it chooses might already be in the task_pool, so it will insert less than the required amount(to_disp) of user_ids (this happens fairly often after performing the query a few times)
If I could use tablesample bernoulli on the RESULT of
WHERE user_id NOT IN (
SELECT clicker_id
FROM task_pool
WHERE order_id = TD.order_id
)
That would be awesome; unfortunately it can only be used on tables, temp tables, or materialized views, and can't be used after WHERE.
I have tried making temp tables of the unused user_ids per row and then tablesample bernoulli (very slow when each row has up to 100,000 unused user_ids)
At this point, I don't think there is any optimization that can be done, other than upgrading the server. Any better ideas?
task_poolonce for every distinctorder_id? Sotask_pool(clicker_id, order_id)is unique? Also defined as unique?order_idare there / can there be?task_dispersalare processed at once. Is that so? By a single query? And then the job is done? Also, istask_dispersal.order_idunique? Defined as unique?