While working on some performance tuning, I came across this posting from Instagram's engineering team:
On some of our tables, we need to index strings (for example, 64 character base64 tokens) that are quite long, and creating an index on those strings ends up duplicating a lot of data. For these, Postgres’ functional index feature can be very helpful:
CREATE INDEX CONCURRENTLY on tokens (substr(token), 0, 8)While there will be multiple rows that match that prefix, having Postgres match those prefixes and then filter down is quick, and the resulting index was 1/10th the size it would have been had we indexed the entire string.
This looked like a good idea, so I tried it -- we have a lot of items that are keyed by an checksum.
Our results were not good. I'm wondering if anyone else has had luck.
First off, the blog post looks wrong:
CREATE INDEX CONCURRENTLY on tokens (substr(token), 0, 8)
Shouldn't that be...
CREATE INDEX CONCURRENTLY on tokens (substr(token, 0, 8));
One of our fields was based on a 40character hash. So I tried :
CREATE INDEX __speed_idx_test_8 on foo (substr(bar, 0, 8));
The query planner wouldn't use it.
So I tried :
CREATE INDEX __speed_idx_test_20 on foo (substr(bar, 0, 20));
The query planner still wouldn't use it.
then i tried:
CREATE INDEX __speed_idx_test_40 on foo (substr(bar, 0, 40));
Still, the planner wouldn't use it.
What if we try and disable seq scans ?
set enable_seqscan=false;
Nope.
Let's go back to our original index.
CREATE INDEX __speed_idx_original on foo (bar);
set enable_seqscan = True;
And that works.
Then I thought -- maybe I need to use a function in the query in order to use a function index. So I tried changing the query:
old:
select * from foo where hash = '%s';
new
select * from foo where substr(hash,0,8) = '%s' and hash = '%s';
And that worked.
Does anyone know if it is possible to make this work without adding in an extra search condition? I'd rather not do that, but looking at the filesize and speed improvements... wow.
and if you're wondering what the 'explain analyze' output was...
-- seq scan
Seq Scan on foo (cost=10000000000.00..10000073130.77 rows=1 width=1921) (actual time=373.785..1563.551 rows=1 loops=1)
Filter: (hash = 'eae1d1728963f107fa7d8136bcf7c72572896e1d'::bpchar)
Rows Removed by Filter: 450252
Total runtime: 1563.687 ms
-- index scan
Index Scan using __speed_idx_original on foo (cost=0.00..16.53 rows=1 width=1920) (actual time=0.060..0.061 rows=1 loops=1)
Index Cond: (hash = 'eae1d1728963f107fa7d8136bcf7c72572896e1d'::bpchar)
Total runtime: 1.501 m
-- index scan with substring function
Index Scan using __speed_idx_test_8 on foo (cost=0.00..16.37 rows=1 width=1913) (actual time=0.134..0.134 rows=0 loops=1)
Index Cond: (substr((hash)::text, 0, 8) = 'eae1d172'::text)
Filter: (hash = 'eae1d1728963f107fa7d8136bcf7c72572896e1d'::bpchar)
Total runtime: 0.216 ms