PostgreSQL array_agg panfully slow

Question

I have a tables calls and calls_statistics. calls has a primary key calls_id, which is a foreign key in calls_statistics.

calls currently contains 16k entries.

When i run

SELECT c.*,
array_agg(cs.mean) AS statistics_means
FROM calls AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
GROUP BY c.calls_id
order by caller_id ASC, call_time ASC LIMIT 100;

The query takes about 622 ms

 Limit  (cost=11947.99..11948.24 rows=100 width=551) (actual time=518.921..518.941 rows=100 loops=1)
   ->  Sort  (cost=11947.99..11989.07 rows=16429 width=551) (actual time=518.918..518.928 rows=100 loops=1)
         Sort Key: c.caller_id, c.call_time
         Sort Method: top-N heapsort  Memory: 126kB
         ->  HashAggregate  (cost=11114.73..11320.09 rows=16429 width=551) (actual time=461.869..494.761 rows=16429 loops=1)
               ->  Hash Right Join  (cost=6234.65..10705.12 rows=81922 width=551) (actual time=79.171..257.498 rows=81922 loops=1)
                     Hash Cond: (cs.calls_id = c.calls_id)
                     ->  Seq Scan on calls_statistics cs  (cost=0.00..2627.22 rows=81922 width=12) (actual time=3.534..26.778 rows=81922 loops=1)
                     ->  Hash  (cost=6029.29..6029.29 rows=16429 width=547) (actual time=75.578..75.578 rows=16429 loops=1)
                           Buckets: 2048  Batches: 1  Memory Usage: 9370kB
                           ->  Seq Scan on calls c  (cost=0.00..6029.29 rows=16429 width=547) (actual time=13.806..42.446 rows=16429 loops=1)
 Total runtime: 622.537 ms

However, when I disable the array_agg and run the query it uses my indexes:

SELECT c.*,
cs.mean
FROM calls AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
order by caller_id ASC, call_time ASC LIMIT 100;

The query just takes 0.565ms!

 Limit  (cost=0.70..52.93 rows=100 width=551) (actual time=0.077..0.320 rows=100 loops=1)
   ->  Nested Loop Left Join  (cost=0.70..42784.95 rows=81922 width=551) (actual time=0.075..0.304 rows=100 loops=1)
         ->  Index Scan using calls_caller_id_call_time_calls_id_idx on calls c  (cost=0.29..22395.06 rows=16429 width=547) (actual time=0.042..0.091 rows=25 loops=1)
         ->  Index Scan using calls_stats_calls_idx on calls_statistics cs  (cost=0.42..1.18 rows=6 width=12) (actual time=0.003..0.005 rows=4 loops=25)
               Index Cond: (c.calls_id = calls_id)
 Total runtime: 0.565 ms

It can't be that just aggregating into arrays takes so much time? What am I doing wrong?

I am using Postgres 9.3.

I also noted that the sort expects 16429 rows but only gets 100. Somehow the query planner fooled itself.. — user3207838
– user3207838, Commented Apr 27, 2016 at 12:15
You removed the group by in the second query. That is probably dominating the time. — Gordon Linoff
– Gordon Linoff, Commented Apr 27, 2016 at 12:15
@GordonLinoff yeah I know, but that's the whole point of my question. — user3207838
– user3207838, Commented Apr 27, 2016 at 12:17
The first query processes and aggregates 81922 rows and then discards 81822 of those. The second query only processes 100 rows. Of course processing (and grouping) 81922 rows takes longer than just retrieving 100 rows. — user330315
– user330315, Commented Apr 27, 2016 at 12:20
@user3207838 Selecting 100 rows is totally diferent operation than goruping-aggregation joined-tables and limit the output. It's just like you're comparing two diferent queries and are dissapoint by execution time. Just like you could take select * from table_a' and select 1` and be suprise that second is quicker. — Gabriel's Messanger
– Gabriel's Messanger, Commented Apr 27, 2016 at 12:23

Ihor Romanchenko · Accepted Answer · 2016-04-27 13:09:00Z

1

One option is to select to 100 rows from table calls and then join and aggregate calls_statistics.

Something like:

WITH top_calls as (SELECT c.*
FROM calls AS c
ORDER BY caller_id ASC, call_time ASC 
LIMIT 100)
SELECT c.*,
array_agg(cs.mean) AS statistics_means
FROM top_calls AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
GROUP BY c.calls_id
order by caller_id ASC, call_time ASC;

It will give you exactly the same output as your first query.

answered Apr 27, 2016 at 13:09

Ihor Romanchenko

29k9 gold badges56 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jakub Kania · Accepted Answer · 2016-04-27 13:09:36Z

0

Optimizing queries without all the info and live system may be a bit hard but I'm gonna take a shot at this. You can move the limit to a subquery and it should work much faster.

SELECT c.*,
array_agg(cs.mean) AS statistics_means
FROM 
  (SELECT *
   FROM calls AS c
   ORDER BY caller_id ASC, call_time ASC 
   LIMIT 100) AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
GROUP BY c.calls_id;

answered Apr 27, 2016 at 13:09

Jakub Kania

16.6k2 gold badges44 silver badges51 bronze badges

Collectives™ on Stack Overflow

PostgreSQL array_agg panfully slow

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related