How to optimize PostgreSQL COUNT GROUP BY query?

Question

I have a table parameters_products with approx 300k records. Is it possible to optimize this query?

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
    (SELECT product_id
     FROM parameters_products
     WHERE parameter_id IN ('2'))
GROUP BY parameter_id

Query output:

2;274669

EXPLAIN ANALYZE VERBOSE... output:

HashAggregate  (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
  Output: parameters_products.parameter_id, count(parameters_products.product_id)
  Group Key: parameters_products.parameter_id
  ->  Hash Semi Join  (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
        Output: parameters_products.parameter_id, parameters_products.product_id
        Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
        ->  Seq Scan on public.parameters_products  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
              Output: parameters_products.parameter_id, parameters_products.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
              Output: parameters_products_1.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products parameters_products_1  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
                    Output: parameters_products_1.product_id
                    Filter: (parameters_products_1.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms

PostgreSQL 9.4.1 and VACUUM is enabled.

Just tried this quesry, but it is slow too:

SELECT pp1.parameter_id,
       count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id

--

HashAggregate  (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
  Output: pp1.parameter_id, count(pp1.product_id)
  Group Key: pp1.parameter_id
  ->  Hash Join  (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
        Output: pp1.parameter_id, pp1.product_id
        Hash Cond: (pp1.product_id = pp2.product_id)
        ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
              Output: pp1.parameter_id, pp1.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
              Output: pp2.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products pp2  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
                    Output: pp2.product_id
                    Filter: (pp2.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms

Indexes:

CREATE INDEX parameters_products_parameter_id_idx
  ON parameters_products
  USING btree
  (parameter_id);

CREATE INDEX parameters_products_product_id_idx
  ON parameters_products
  USING btree
  (product_id);

CREATE INDEX parameters_products_product_id_parameter_id_idx
  ON parameters_products
  USING btree
  (product_id, parameter_id);

EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id

-

Hash Left Join  (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
  Output: pp1.parameter_id
  Hash Cond: (pp1.product_id = pp2.product_id)
  ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
        Output: pp1.parameter_id, pp1.product_id
  ->  Hash  (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
        Output: pp2.product_id
        Buckets: 16384  Batches: 4  Memory Usage: 2644kB
        ->  Seq Scan on public.parameters_products pp2  (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
              Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms

SET enable_seqscan = OFF;

Decreased the execution time, but not significantly.

Isn't 2361 ms is 2.36 seconds? So for processing 300k records, isn't it already good? — Utsav
– Utsav, Commented Oct 18, 2015 at 16:48
Note Rows removed by filter is < 10 % of total rows. An index would make little sense here. — wildplasser
– wildplasser, Commented Oct 18, 2015 at 17:18
It seems that no matter what you do, you need to count about 90% of the records. This is going to take some effort. If optimizing this type of query is really important, you might need to implement triggers to pre-summarize the data. — Gordon Linoff
– Gordon Linoff, Commented Oct 18, 2015 at 17:49

Gordon Linoff · Accepted Answer · 2015-10-18 16:18:32Z

3

The first thing I would try is replacing the IN with an EXISTS:

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
              FROM parameters_products pp2
              WHERE pp2.product_id = pp.product_id AND
                    pp2.parameter_id = 2
             ) 
GROUP BY parameter_id;

And, be sure you have an index on parameters_products(product_id, parameter_id).

Another idea is to use window functions:

select parameter_id, count(*)
from (select pp.*,
             sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
      from parameters_products pp
     ) pp
where cnt2 > 0
group by parameter_id;

answered Oct 18, 2015 at 16:18

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

nanolab Over a year ago

Index already exists: CREATE INDEX parameters_products_product_id_parameter_id_idx ON parameters_products USING btree (product_id, parameter_id); For first query "Execution time: 2239.944 ms" and for second "Execution time: 2526.269 ms"

wildplasser Over a year ago

You could try an index with the opposite order ... INDEX on parameter_products(parameter_id, product_id) .

nanolab Over a year ago

@wildplasser Still the same time

nanolab Over a year ago

Used your first query with SET enable_seqscan = OFF; and this decreased time! Thanks.

AlVaz · Accepted Answer · 2015-10-18 19:43:23Z

2

Try:

SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT
FROM parameters_products pp1
JOIN
  parameters_products pp2
ON
  pp2.parameter_id = 2
AND
  pp1.product_id = pp2.product_id
GROUP BY
  pp1.parameter_id

Moving the filter criteria from your WHERE clause to the ON clause reduces the total number of rows involved in the JOIN. Hopefully this demonstrates the same savings you saw in your comment that brought the execution time below 1s.

edited Oct 18, 2015 at 19:43

answered Oct 18, 2015 at 19:35

AlVaz

7661 gold badge6 silver badges21 bronze badges

5 Comments

nanolab Over a year ago

Sorry, but this is wrong query. There is no sense in JOIN. The result will be the same with: SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM parameters_products pp1 GROUP BY pp1.parameter_id

AlVaz Over a year ago

@nanolab I have updated my answer and used an inner join instead of a left join to accurately reproduce the results of your WHERE clause. My apologies for overlooking that.

nanolab Over a year ago

Yes, it's correct now. But "Execution time: 2249.975 ms". Seems it just impossible to optimize.

AlVaz Over a year ago

@nanolab: One more test. Try

SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM (SELECT product_id FROM parameters_product WHERE parameter_id = 2) AS pp2  JOIN  parameters_products pp1 ON  pp1.product_id = pp2.product_id GROUP BY pp1.parameter_id

This will make the smaller table the driving table of the JOIN and might get us back to the sub-second performance we saw before.

nanolab Over a year ago

Execution time: 2240.112 ms

AlVaz · Accepted Answer · 2015-10-18 20:11:01Z

0

RhodiumToad in #postgresql on freenode recommended a window function like the following. Note this is different than Gordon Linoff's window function by using bool_or instead of sum(case...):

SELECT parameter_id, count(product_id)
FROM
  (SELECT *, bool_or(parameter_id = 2)
   OVER
   (partition by product_id) AS matching
   FROM parameters_products) s
WHERE matching
GROUP BY parameter_id;

RhodiumToad also mentioned that the work_mem parameter may be too small for any query of this scale, whether using a window function, join, or subselect. He recommends increasing the work_mem parameter to avoid sorting routines from spilling to disk.

If either of these help you out, all credit to RhodiumToad.

answered Oct 18, 2015 at 20:11

AlVaz

7661 gold badge6 silver badges21 bronze badges

4 Comments

nanolab Over a year ago

Used this SET LOCAL work_mem = '500MB'; But query is even slower than others: "Execution time: 3079.735 ms"

AlVaz Over a year ago

@nanolab Can you post an EXPLAIN ANALYZE?

AlVaz Over a year ago

@nanolab What is the data type of parameter_id?

nanolab Over a year ago

product_id integer NOT NULL, parameter_id integer NOT NULL,

Collectives™ on Stack Overflow

How to optimize PostgreSQL COUNT GROUP BY query?

3 Answers 3

4 Comments

5 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related