1

I'm designing application for shop catalog, and faced quite slow performance of PostgreSQL.

Here is simplified db scheme (actually there are extra tables for many-to-many relations): db scheme

I'd like to implement filter by attributes(color, size, brand, etc) based on the selected catalog categories(T-Shirts, bags, etc)

Here is an example of the query selecting avaliable attributes for selected list of categories.

SELECT DISTINCT T1.attribute_id
FROM item T0 LEFT OUTER JOIN item_attr_color T1 ON ( T0.id = T1.item_id ) 
WHERE T0.catalog_id IN (1, 2, 6, 7, 14, 23, 26, 31, 36, 37, 45, 67, 70, 76, 77, 81, 95, 112, 118, 119, 120, 10, 11, 29, 101, 12, 13, 16, 17, 19, 20, 30, 33, 35, 42, 43, 47, 48, 54, 57, 58, 69, 78, 109, 56, 64, 65, 66, 68, 71, 74, 75, 93, 72, 73, 87, 88, 96, 99, 103, 105, 108, 110);

Currently database is rather small ~100k records, but still this query took upto 400msec, which is quite a lot because I have 10 different filter attribute and these queries alone took 4sec to alone, which is unacceptable.

I have indexes(btree type) on all vital fields, here is output of the explain command

HashAggregate  (cost=28309.30..28309.43 rows=13 width=4) (actual time=343.343..343.347 rows=14 loops=1)
->  Hash Right Join  (cost=24284.42..28074.04 rows=94103 width=4) (actual time=185.278..315.749 rows=115745 loops=1)
     Hash Cond: (t1.item_id = t0.id)
     ->  Seq Scan on core_item_attr_colors t1  (cost=0.00..1797.13 rows=108913 width=8) (actual time=0.006..18.387 rows=107175 loops=1)
     ->  Hash  (cost=23108.13..23108.13 rows=94103 width=4) (actual time=185.182..185.182 rows=93778 loops=1)
           Buckets: 16384  Batches: 1  Memory Usage: 3297kB
           ->  Seq Scan on core_item t0  (cost=0.00..23108.13 rows=94103 width=4) (actual time=0.020..153.334 rows=93778 loops=1)
                 Filter: (catalog_id = ANY ('{1,2,6,7,14,23,26,31,36,37,45,67,70,76,77,81,95,112,118,119,120,10,11,29,101,12,13,16,17,19,20,30,33,35,42,43,47,48,54,57,58,69,78,109,56,64,65,66,68,71,74,75,93,72,73,87,88,96,99,103,105,108,110}'::integer[]))
                 Rows Removed by Filter: 19677
Total runtime: 361.231 ms

As you can see it doesn't use any indexes, but I've noticed that decreasing number of categories will eventually force it using indexes:

 HashAggregate  (cost=18685.04..18685.17 rows=13 width=4) (actual time=166.760..166.764 rows=14 loops=1)
 ->  Hash Right Join  (cost=15515.08..18626.42 rows=23447 width=4) (actual time=56.499..156.865 rows=26501 loops=1)
     Hash Cond: (u2.item_id = u0.id)
     ->  Seq Scan on core_item_attr_colors u2  (cost=0.00..1797.13 rows=108913 width=8) (actual time=0.010..25.706 rows=107175 loops=1)
     ->  Hash  (cost=15221.99..15221.99 rows=23447 width=4) (actual time=56.444..56.444 rows=23099 loops=1)
           Buckets: 4096  Batches: 1  Memory Usage: 813kB
           ->  Bitmap Heap Scan on core_item u0  (cost=1058.03..15221.99 rows=23447 width=4) (actual time=9.732..45.643 rows=23099 loops=1)
                 Recheck Cond: (catalog_id = ANY ('{1,2,6,7,14,23,26,31,36,37,45,67,70,76,77,81,95,112,118,119}'::integer[]))
                 ->  Bitmap Index Scan on core_item_89ed0239  (cost=0.00..1052.17 rows=23447 width=0) (actual time=6.523..6.523 rows=23099 loops=1)
                       Index Cond: (catalog_id = ANY ('{1,2,6,7,14,23,26,31,36,37,45,67,70,76,77,81,95,112,118,119}'::integer[]))
Total runtime: 166.858 ms

I've tried replacing postgresql with sqllite and got quite impressive results for the same query on the exactly same data set, it took less than 60msec.

Here is my config file:

max_connections = 100
temp_buffers = 8MB
work_mem = 96MB
maintenance_work_mem = 512MB
effective_cache_size = 512MB

The server has 6G of RAM and an SSD disk.

What am I missing? I'd appreciate any suggestions how to improve performance here.

UPDATE1: shared_buffers = 1024MB and it's PostgreSQL v.9.3

1
  • What're your shared_buffers? Also - PostgreSQL version? Commented Nov 6, 2015 at 13:09

3 Answers 3

2

First, the left join is unnecessary, unless you really want to get a NULL value back (which is doubtful). So, you can write this as:

SELECT DISTINCT T1.attribute_id
FROM item T0 JOIN
     item_attr_color T1
     ON T0.id = T1.item_id
WHERE T0.catalog_id IN (1, 2, 6, 7, 14, 23, 26, 31, 36, 37, 45, 67, 70, 76, 77, 81, 95, 112, 118, 119, 120, 10, 11, 29, 101, 12, 13, 16, 17, 19, 20, 30, 33, 35, 42, 43, 47, 48, 54, 57, 58, 69, 78, 109, 56, 64, 65, 66, 68, 71, 74, 75, 93, 72, 73, 87, 88, 96, 99, 103, 105, 108, 110);

Next, assuming that you have an attribute table, you can get rid of the distinct and use subqueries:

select a.id
from attribute a
where exists (select 1
              from item_attr_color iac join
                   item i
                   on i.id = iac.item_id
              where i.catalog_id in ( . . .) and
                    iac.attribute_id = a.attribute_id
             );

Then, for this query, you want the following indexes: item(id, catalog_id), item_attr_color(attribute_id, item_id) and of course attribute(id).

This might help performance, by bringing in indexes and eliminating the processing for distinct.

It also might be worth trying the in version:

select a.id
from attribute a
where a.attribute_id in (select iac.attribute_id
                         from item_attr_color iac join
                              item i
                              on i.id = iac.item_id
                         where i.catalog_id in ( . . .)
                        );

The indexes for this query are: item(catalog_id, id), item_attr_color(item_id, attribute_id) and of course attribute(id).

Sign up to request clarification or add additional context in comments.

Comments

0

Indexes

Lots of indexes isn't necessarily good. Ideally you should have indexes where:

  • They are highly selective, i.e. a query on the index will find 5% or less of the table;
  • They are used by many queries

Each index has a cost for insert/update performance. So get rid of ones you don't need.

Also, you can get good results with composite, partial, and expression indexes. Just adding a single column index to everything in sight is rarely the best option.

Try to write queries that benefit from index-only scans, too. In this case I suspect an index on item_attr_color(item_id, attribute_id) might be beneficial.

Tunables

If you have lots of RAM relative to the data size, and fast disks, lower random_page_cost. Lots. Try

 SET random_page_cost = 1.2

and re-running your query (in the same session, immediately afterwards).

Collation

If you're sorting strings and don't need localized sorting, use of COLLATE "C" can be very helpful.

= ANY or IN lists

Long IN or = ANY lists aren't very efficient. They're traversed sequentially. 100 entries probably isn't too bad, but if you really have lots, consider joining on a VALUES list instead.

You might also want to look into the intarray support for GIN indexes, so you can write:

WHERE T0.catalog_id && ARRAY[1, 2, 6, 7, 14, 23, 26, 31, 36, 37, 45, 67, 70, 76, 77, 81, 95, 112, 118, 119, 120, 10, 11, 29, 101, 12, 13, 16, 17, 19, 20, 30, 33, 35, 42, 43, 47, 48, 54, 57, 58, 69, 78, 109, 56, 64, 65, 66, 68, 71, 74, 75, 93, 72, 73, 87, 88, 96, 99, 103, 105, 108, 110]

as an indexable operation. && means "overlaps"

SQLite

I'm not surprised SQLite does well. It's very fast at simple to moderate queries on read-only or nearly read-only workloads.

Comments

0

As well as query and index tuning, if you suspect that many of your queries will be driven by particular predicates (in your example WHERE T0.catalog_id IN (1, 2, ... 108, 110)) then consider running a cluster on the table by the index on that column.

This gives you a better chance of the index being useful, and therefore chosen as part of the execution plan.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.