Postgres Proper Index for Sorting and Join

Question

I have a simple schema and query, but am experiencing consistent awful performance with certain parameters.

Schema:

CREATE TABLE locations (
  id integer NOT NULL,
  barcode_id integer NOT NULL
);

CREATE TABLE barcodes (
  id integer NOT NULL,
  value citext NOT NULL
);

ALTER TABLE ONLY locations ADD CONSTRAINT locations_pkey PRIMARY KEY (id);
ALTER TABLE ONLY barcodes ADD CONSTRAINT barcodes_pkey PRIMARY KEY (id);
ALTER TABLE ONLY locations ADD CONSTRAINT fk_locations_barcodes FOREIGN KEY (barcode_id) REFERENCES barcodes(id);

CREATE INDEX index_barcodes_on_value ON barcodes (value);
CREATE INDEX index_locations_on_barcode_id ON locations (barcode_id);

Query:

EXPLAIN ANALYZE
SELECT *
FROM locations
JOIN barcodes ON locations.barcode_id = barcodes.id
ORDER BY barcodes.value ASC
LIMIT 50;

Analysis:

Limit  (cost=0.71..3564.01 rows=50 width=34) (actual time=0.043..683.025 rows=50 loops=1)
  ->  Nested Loop  (cost=0.71..4090955.00 rows=57404 width=34) (actual time=0.043..683.017 rows=50 loops=1)
        ->  Index Scan using index_barcodes_on_value on barcodes  (cost=0.42..26865.99 rows=496422 width=15) (actual time=0.023..218.775 rows=372138 loops=1)
        ->  Index Scan using index_locations_on_barcode_id on locations  (cost=0.29..5.32 rows=287 width=8) (actual time=0.001..0.001 rows=0 loops=372138)
              Index Cond: (barcode_id = barcodes.id)
Planning time: 0.167 ms
Execution time: 683.078 ms

500+ ms for the number of entries in my schema (500,000 barcodes and 60,000 locations) doesn't make sense. Can I do anything to improve the performance?

Note:

Even stranger is the execution time depends on the data. In drafting this question I attempted to include seeded random data, but the seeds seem to be performant:

Seed:

INSERT INTO barcodes (id, value) SELECT seed.id, gen_random_uuid() FROM generate_series(1,500000) AS seed(id);
INSERT INTO locations (id, barcode_id) SELECT seed.id, (RANDOM() * 500000)  FROM generate_series(1,60000) AS seed(id);

Analysis:

Limit  (cost=0.71..3602.63 rows=50 width=86) (actual time=0.089..1.123 rows=50 loops=1)
  ->  Nested Loop  (cost=0.71..4330662.42 rows=60116 width=86) (actual time=0.088..1.115 rows=50 loops=1)
        ->  Index Scan using index_barcodes_on_value on barcodes  (cost=0.42..44972.42 rows=500000 width=41) (actual time=0.006..0.319 rows=376 loops=1)
        ->  Index Scan using index_locations_on_barcode_id on locations  (cost=0.29..5.56 rows=301 width=8) (actual time=0.002..0.002 rows=0 loops=376)
              Index Cond: (barcode_id = barcodes.id)
Planning time: 0.213 ms
Execution time: 1.152 ms

Edit:

Analysis of the tables:

ANALYZE VERBOSE barcodes;
INFO:  analyzing "public.barcodes"
INFO:  "barcodes": scanned 2760 of 2760 pages, containing 496157 live 
rows and 0 dead rows; 30000 rows in sample, 496157 estimated total rows
ANALYZE
Time: 62.937 ms

ANALYZE VERBOSE locations;
INFO:  analyzing "public.locations"
INFO:  "locations": scanned 254 of 254 pages, containing 57394 live rows 
and 0 dead rows; 30000 rows in sample, 57394 estimated total rows
ANALYZE
Time: 21.447 ms

@Jasen reading up on 'cluster' - I've never used. I tried running: CLUSTER barcodes USING barcodes_pkey; and CLUSTER locations USING index_locations_on_barcode_id; but it didn't seem to help. Should I be using it differently? — Stussa
– Stussa, Commented Mar 29, 2017 at 4:22
@Jasen also - it sounds like 'cluster' isn't a permanent command - but might require re-runs? — Stussa
– Stussa, Commented Mar 29, 2017 at 4:29
Stupid question: did you run ANALYZE on both tables before you ran your query? — Laurenz Albe
– Laurenz Albe, Commented Mar 29, 2017 at 8:18

Laurenz Albe · Accepted Answer · 2017-03-29 19:20:47Z

2

The problem is that the barcodes with low values don't have matches in locations, which PostgreSQL cannot know. So its plan to fetch the barcodes in the correct output order via the index and then join values from locations until it found 50 of them is much worse than it expected.

I would ANALYZE the tables and

DROP INDEX index_barcodes_on_value;

That should keep PostgreSQL from choosing that plan.

I don't know what plan PostgreSQL will choose then. For a nested loop the following index might help:

CREATE INDEX ON locations(id);

answered Mar 29, 2017 at 19:20

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bobflux Over a year ago

Hehe, that was exactly what I was about to reply, so I'm upvoting. Optimizing this is going to be tricky, as removing the index will need a huge, slow sort, unless the number of matching rows in both tables is very small. Materializing the "value" column in "locations" table, along with an index, would make the query very fast, but it denormalizes and fattens the table, so it should only be considered if this is a very important query.

Collectives™ on Stack Overflow

Postgres Proper Index for Sorting and Join

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related