I have a single small table(~400k rows), the table is indexed by collection_id and contains a JSON column with several GIN indexes defined, one of them is on the value tagline.id.
The query to get all the objects with a specific tagline.id sometimes is VERY slow:
explain (analyze, buffers)
SELECT "objects_object"."created",
"objects_object"."modified",
"objects_object"."_id",
"objects_object"."id",
"objects_object"."collection_id",
"objects_object"."data",
"objects_object"."search",
"objects_object"."location"::bytea
FROM "objects_object"
WHERE ("objects_object"."collection_id" IN (3381, 3321, 3312, 3262, 3068, 2684, 2508, 2159, 2158, 2154, 2157, 2156)
AND (("objects_object"."data" #>> ARRAY['tagline','id']))::float IN ('8')
AND ("objects_object"."data" -> 'tagline') ? 'id')
ORDER BY "objects_object"."created" DESC,
"objects_object"."id" ASC
LIMIT 101;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.46..8.47 rows=1 width=1239) (actual time=5513.374..5513.399 rows=101 loops=1)
Buffers: shared hit=4480 read=6261
-> Sort (cost=8.46..8.47 rows=1 width=1239) (actual time=5513.372..5513.389 rows=101 loops=1)
Sort Key: created DESC, id
Sort Method: top-N heapsort Memory: 247kB
Buffers: shared hit=4480 read=6261
-> Index Scan using index_tagline_id_float_51a27976 on objects_object (cost=0.42..8.45 rows=1 width=1239) (actual time=943.689..5513.002 rows=235 loops=1)
Index Cond: (((data #>> '{tagline,id}'::text[]))::double precision = '8'::double precision)
Filter: (collection_id = ANY ('{3381,3321,3312,3262,3068,2684,2508,2159,2158,2154,2157,2156}'::integer[]))
Rows Removed by Filter: 47295
Buffers: shared hit=4480 read=6261
Planning time: 0.244 ms
Execution time: 5513.439 ms
(13 rows)
If executed multiple times the execution time drops to ~ 5 ms.
What is taking so long? Why after the first time the execution time drops that much?
I don't think it's memory related since the default memory(4MB) is much higher than the required(247Kb).
EDIT: Index definitions:
SELECT indexdef FROM pg_indexes
WHERE indexname = 'index_tagline_id_float_51a27976';
indexdef
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CREATE INDEX index_tagline_id_float_51a27976 ON public.objects_object USING btree ((((data #>> ARRAY['tagline'::text, 'id'::text]))::double precision)) WHERE ((data -> 'tagline'::text) ? 'id'::text)
(1 row)
SELECT indexdef FROM pg_indexes
WHERE indexname = 'objects_object_collection_id_6f1559f5';
indexdef
---------------------------------------------------------------------------------------------------------
CREATE INDEX objects_object_collection_id_6f1559f5 ON public.objects_object USING btree (collection_id)
(1 row)
EDIT:
After adding the index test:
select indexdef from pg_indexes where indexname='test';
indexdef
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CREATE INDEX test ON public.objects_object USING btree ((((data #>> ARRAY['tagline'::text, 'id'::text]))::double precision), collection_id) WHERE ((data -> 'tagline'::text) ? 'id'::text)
(1 row)
The execution time decreased, but so the buffer shared hit, not sure this improved the performance then:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.46..8.47 rows=1 width=1238) (actual time=1721.260..1721.281 rows=101 loops=1)
Buffers: shared hit=5460 read=5115
-> Sort (cost=8.46..8.47 rows=1 width=1238) (actual time=1721.257..1721.270 rows=101 loops=1)
Sort Key: created DESC, id
Sort Method: top-N heapsort Memory: 298kB
Buffers: shared hit=5460 read=5115
-> Index Scan using test on objects_object (cost=0.42..8.45 rows=1 width=1238) (actual time=1682.637..1720.793 rows=235 loops=1)
Index Cond: (((data #>> '{tagline,id}'::text[]))::double precision = '8'::double precision)
Filter: (collection_id = ANY ('{3381,3321,3312,3262,3068,2684,2508,2159,2158,2154,2157,2156}'::integer[]))
Rows Removed by Filter: 47295
Buffers: shared hit=5454 read=5115
Planning time: 238.364 ms
Execution time: 1762.996 ms
(13 rows)
The problem seems to be that collection_id should be part of the index condition, not on the filtering, this would avoid getting from the (slow) data storage a large amount of data.
Why is the index not working as expected?
UPDATE: Apparently the order of the parameters impacted on the query plan, i rewrote the index as:
select indexdef from pg_indexes where indexname='test';
indexdef
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CREATE INDEX test ON public.objects_object USING btree (collection_id, (((data #>> ARRAY['tagline'::text, 'id'::text]))::double precision)) WHERE ((data -> 'tagline'::text) ? 'id'::text)
Running now the query we can see the lower number of read records:
Limit (cost=57.15..57.16 rows=1 width=1177) (actual time=1.043..1.059 rows=101 loops=1)
Buffers: shared hit=101 read=10
-> Sort (cost=57.15..57.16 rows=1 width=1177) (actual time=1.040..1.047 rows=101 loops=1)
Sort Key: created DESC, id
Sort Method: top-N heapsort Memory: 304kB
Buffers: shared hit=101 read=10
-> Index Scan using test on objects_object (cost=0.42..57.14 rows=1 width=1177) (actual time=0.094..0.670 rows=232 loops=1)
Index Cond: ((collection_id = ANY ('{3381,3321,3312,3262,3068,2684,2508,2159,2158,2154,2157,2156}'::integer[])) AND (((data #>> '{tagline,id}'::text[]))::double precision = '8'::double precisio
n))
Buffers: shared hit=95 read=10
Planning time: 416.365 ms
Execution time: 43.463 ms
(11 rows)
collection_id?collection_id, it's not being used on this query though as far as i can see.ANALYZEon the table improve the performance?last_analyzeis NULL (butlast_autoanalyzeshouldn't be).