I'm running Postgres 9.6.6 and I have a relatively simple query, however, I'm finding that it is slow with the hash aggregation function which is the most costly operation:
select
program_schedule_source_count.full_name,
program_schedule_source_count.country,
sum(program_schedule_source_count.displays) as displays,
program_schedule_source_count.source_id,
program_schedule_source_count.source_region
from program_schedule_source_count
where program_schedule_source_count.original_title = 'How I Met Your Mother'
and program_schedule_source_count.show_type in ('SM', 'SE')
and program_schedule_source_count.start_date between '20200101' and '20200217'
group by
program_schedule_source_count.source_id,
program_schedule_source_count.full_name,
program_schedule_source_count.country,
program_schedule_source_count.source_region;
Below is the query plan:
HashAggregate (cost=1139676.26..1139725.25 rows=3919 width=46) (actual time=18769.670..18770.066 rows=736 loops=1)
Group Key: source_id, full_name, country, source_region
-> Index Scan using title_date_show_type_country on program_schedule_source_count (cost=0.70..1139186.45 rows=39185 width=46) (actual time=0.098..18733.005 rows=42654 loops=1)
Index Cond: ((start_date >= '20200101'::bpchar) AND (start_date <= '20200217'::bpchar) AND ((original_title)::text = 'How I Met Your Mother'::text))
Filter: (show_type = ANY ('{SM,SE}'::bpchar[]))
Planning time: 0.223 ms
Execution time: 18770.252 ms
The table has indexes on all of the fields in the where clause:
CREATE UNIQUE INDEX program_schedule_pkey ON public.program_schedule_source_count USING btree (source_id, start_date, program_id);
CREATE INDEX title_date_show_type_country ON public.program_schedule_source_count USING btree (start_date, original_title, release_year, show_type, country);
I have tried changing the order of the group by function but this did nothing to alter the performance. I tried to disable hash aggregation to see if this would speed up the query but it still runs in roughly the same amount of time. I assume adding an index to the group by fields would not have any benefit because I am not searching these fields.
I've seen that clustering could benefit but in the documentation it says you would cluster by an index does this mean I would have to create another index of the group by fields and then cluster the data by this?
Is there an alternative way I can write the query which could make it faster?
Thanks all for your help
explain (analyze, buffers)to see how much data was retrieved during the scan. If you turn ontrack_io_timingsthen you can also see how fast that was