2

I have a table parameters_products with approx 300k records. Is it possible to optimize this query?

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
    (SELECT product_id
     FROM parameters_products
     WHERE parameter_id IN ('2'))
GROUP BY parameter_id

Query output:

2;274669

EXPLAIN ANALYZE VERBOSE... output:

HashAggregate  (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
  Output: parameters_products.parameter_id, count(parameters_products.product_id)
  Group Key: parameters_products.parameter_id
  ->  Hash Semi Join  (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
        Output: parameters_products.parameter_id, parameters_products.product_id
        Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
        ->  Seq Scan on public.parameters_products  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
              Output: parameters_products.parameter_id, parameters_products.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
              Output: parameters_products_1.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products parameters_products_1  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
                    Output: parameters_products_1.product_id
                    Filter: (parameters_products_1.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms

PostgreSQL 9.4.1 and VACUUM is enabled.

Just tried this quesry, but it is slow too:

SELECT pp1.parameter_id,
       count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id

--

HashAggregate  (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
  Output: pp1.parameter_id, count(pp1.product_id)
  Group Key: pp1.parameter_id
  ->  Hash Join  (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
        Output: pp1.parameter_id, pp1.product_id
        Hash Cond: (pp1.product_id = pp2.product_id)
        ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
              Output: pp1.parameter_id, pp1.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
              Output: pp2.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products pp2  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
                    Output: pp2.product_id
                    Filter: (pp2.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms

Indexes:

CREATE INDEX parameters_products_parameter_id_idx
  ON parameters_products
  USING btree
  (parameter_id);

CREATE INDEX parameters_products_product_id_idx
  ON parameters_products
  USING btree
  (product_id);

CREATE INDEX parameters_products_product_id_parameter_id_idx
  ON parameters_products
  USING btree
  (product_id, parameter_id);

EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id

-

Hash Left Join  (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
  Output: pp1.parameter_id
  Hash Cond: (pp1.product_id = pp2.product_id)
  ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
        Output: pp1.parameter_id, pp1.product_id
  ->  Hash  (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
        Output: pp2.product_id
        Buckets: 16384  Batches: 4  Memory Usage: 2644kB
        ->  Seq Scan on public.parameters_products pp2  (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
              Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms

SET enable_seqscan = OFF;

Decreased the execution time, but not significantly.

20
  • 1
    Replace WHERE IN with JOIN Commented Oct 18, 2015 at 16:14
  • 1
    @lad2025 Execution time: 2361.735 ms Commented Oct 18, 2015 at 16:28
  • 1
    Isn't 2361 ms is 2.36 seconds? So for processing 300k records, isn't it already good? Commented Oct 18, 2015 at 16:48
  • 2
    Note Rows removed by filter is < 10 % of total rows. An index would make little sense here. Commented Oct 18, 2015 at 17:18
  • 1
    It seems that no matter what you do, you need to count about 90% of the records. This is going to take some effort. If optimizing this type of query is really important, you might need to implement triggers to pre-summarize the data. Commented Oct 18, 2015 at 17:49

3 Answers 3

3

The first thing I would try is replacing the IN with an EXISTS:

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
              FROM parameters_products pp2
              WHERE pp2.product_id = pp.product_id AND
                    pp2.parameter_id = 2
             ) 
GROUP BY parameter_id;

And, be sure you have an index on parameters_products(product_id, parameter_id).

Another idea is to use window functions:

select parameter_id, count(*)
from (select pp.*,
             sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
      from parameters_products pp
     ) pp
where cnt2 > 0
group by parameter_id;
Sign up to request clarification or add additional context in comments.

4 Comments

Index already exists: CREATE INDEX parameters_products_product_id_parameter_id_idx ON parameters_products USING btree (product_id, parameter_id); For first query "Execution time: 2239.944 ms" and for second "Execution time: 2526.269 ms"
You could try an index with the opposite order ... INDEX on parameter_products(parameter_id, product_id) .
@wildplasser Still the same time
Used your first query with SET enable_seqscan = OFF; and this decreased time! Thanks.
2

Try:

SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT
FROM parameters_products pp1
JOIN
  parameters_products pp2
ON
  pp2.parameter_id = 2
AND
  pp1.product_id = pp2.product_id
GROUP BY
  pp1.parameter_id

Moving the filter criteria from your WHERE clause to the ON clause reduces the total number of rows involved in the JOIN. Hopefully this demonstrates the same savings you saw in your comment that brought the execution time below 1s.

5 Comments

Sorry, but this is wrong query. There is no sense in JOIN. The result will be the same with: SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM parameters_products pp1 GROUP BY pp1.parameter_id
@nanolab I have updated my answer and used an inner join instead of a left join to accurately reproduce the results of your WHERE clause. My apologies for overlooking that.
Yes, it's correct now. But "Execution time: 2249.975 ms". Seems it just impossible to optimize.
@nanolab: One more test. Try SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT FROM (SELECT product_id FROM parameters_product WHERE parameter_id = 2) AS pp2 JOIN parameters_products pp1 ON pp1.product_id = pp2.product_id GROUP BY pp1.parameter_id This will make the smaller table the driving table of the JOIN and might get us back to the sub-second performance we saw before.
Execution time: 2240.112 ms
0

RhodiumToad in #postgresql on freenode recommended a window function like the following. Note this is different than Gordon Linoff's window function by using bool_or instead of sum(case...):

SELECT parameter_id, count(product_id)
FROM
  (SELECT *, bool_or(parameter_id = 2)
   OVER
   (partition by product_id) AS matching
   FROM parameters_products) s
WHERE matching
GROUP BY parameter_id;

RhodiumToad also mentioned that the work_mem parameter may be too small for any query of this scale, whether using a window function, join, or subselect. He recommends increasing the work_mem parameter to avoid sorting routines from spilling to disk.

If either of these help you out, all credit to RhodiumToad.

4 Comments

Used this SET LOCAL work_mem = '500MB'; But query is even slower than others: "Execution time: 3079.735 ms"
@nanolab Can you post an EXPLAIN ANALYZE?
@nanolab What is the data type of parameter_id?
product_id integer NOT NULL, parameter_id integer NOT NULL,

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.