8

I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db

=#\d nodes
            Table "public.nodes"
 Column |          Type          | Modifiers 
--------+------------------------+-----------
 id     | integer                | not null
 title  | character varying(256) | 
 score  | double precision       | 
Indexes:
    "nodes_pkey" PRIMARY KEY, btree (id)

I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s")

=# explain analyze select title,score from nodes where title ilike 's%' order by score desc; 
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Sort  (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1)
   Sort Key: score
   Sort Method:  external merge  Disk: 5712kB
   ->  Seq Scan on nodes  (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1)
         Filter: ((title)::text ~~* 's%'::text)
 Total runtime: 5260.791 ms
(6 rows)

This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index

=# create index title_idx on nodes using btree(lower(title) text_pattern_ops);
=# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1)
   ->  Sort  (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1)
         Sort Key: score
         Sort Method:  top-N heapsort  Memory: 17kB
         ->  Bitmap Heap Scan on nodes  (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1)
               Filter: (lower((title)::text) ~~ 's%'::text)
               ->  Bitmap Index Scan on title_idx  (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1)
                     Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text))
 Total runtime: 1325.085 ms
(9 rows)

So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?

0

4 Answers 4

7

You will need a text_pattern_ops index if you're not in the C locale.

See: index types.

Sign up to request clarification or add additional context in comments.

1 Comment

text_pattern_opts is currently mentioned on this page btw: postgresql.org/docs/17/indexes-opclass.html
5

Tips for further investigation :

  • Partition the table on the title key. This makes the lists smaller that postgres need to work with.

  • give postgresql more memory so the cache hit rate > 98%. This table will take about 0.5G, I think 2G should be no problem nowadays. Make sure statistics collection is enabled and read up on the pg_stats tables.

  • Make a second table with a reduced sustring of the title e.g. 12 characters so the complete table fits in less database blocks. An index on a substring may also work, but requires careful querying.

  • The long the substring, the faster the query will run. Create a separate table for small substrings, and store in the value the top ten or whatever of choices you would want to show. There are about 20000 combinations of 1,2,3 character strings.

  • You can use the same idea if you want to have %abc% queries, but probably switching to lucene makes sense now.

Comments

0

You're obviously not interested in 150000+ results, so you should limit them:

select title,score
  from nodes
  where title ilike 's%'
  order by score desc
  limit 10;

You can also consider creating functional index, and using ">=" and "<":

create index nodes_title_lower_idx on nodes (lower(title));
select title,score
  from nodes
  where lower(title)>='s' and lower(title)<'t'
  order by score desc
  limit 10;

You should also create index on score, which will help in ilike %s% case.

Comments

0

Simultaneous prefix + suffix LIKE '%abc%' can be sped up with gin + pg_trgm

Usage:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE TABLE "mytable" ("col1" TEXT);
CREATE INDEX "mytable_col1_gin" ON "mytable" USING gin("col1" gin_trgm_ops);
EXPLAIN SELECT * FROM "mytable" WHERE "col1" LIKE '%abc%';

which produces:

                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Bitmap Heap Scan on mytable  (cost=15.10..104.10 rows=400 width=72)
   Recheck Cond: (col1 ~~ '%abc%'::text)
   ->  Bitmap Index Scan on mytable_col1_gin  (cost=0.00..15.00 rows=400 width=0)
         Index Cond: (col1 ~~ '%abc%'::text)

which means that the query was sped up.

Notes:

A fun way to test this out is to generate a massive test database e.g. with:

and then see if our queries are actually faster. I recommend at least 10 GB of text, otherwise the indexing speedup is not very clear on my system.

Tested on Ubuntu 24.04, PostgreSQL 16.6.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.