Postgres Text Search with Additional Words/Tokens

Question

I have a table with English sentences. Given a sentence which may contain an additional word, or a distorted word, can I find the closest sentence in the table using Postgres' Text-Search capabilities?

to_tsvector('a b c') @@ plainto_tsquery('a b') returns true

to_tsvector('a b') @@ plainto_tsquery('a b c') returns false

I would like scenario 2 to return true as well.

Notes:

The length of the sentences may be dozens of words. I'm looking for an efficient solution..
Other text search engines such as Elastic/Solr will successfully return the closest result.

More information regrading performance of the trigram index.

EXPLAIN (ANALYSE, BUFFERS)
SELECT
    similarity(title, 'electode paste composition') as sml,
    title
FROM
    table
WHERE
    title % 'electode paste composition'
ORDER BY
    sml DESC;

returns:

Gather Merge  (cost=1880112.22..1902381.94 rows=190870 width=93) (actual time=36355.303..36356.143 rows=5 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=407649
  ->  Sort  (cost=1879112.20..1879350.78 rows=95435 width=93) (actual time=36344.180..36344.180 rows=2 loops=3)
        Sort Key: (similarity(title, 'electode paste composition'::text)) DESC"
        Sort Method: quicksort  Memory: 25kB
        Worker 0:  Sort Method: quicksort  Memory: 25kB
        Worker 1:  Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=407649
        ->  Parallel Bitmap Heap Scan on table  (cost=2759.10..1866325.66 rows=95435 width=93) (actual time=35940.284..36344.141 rows=2 loops=3)
              Recheck Cond: (title % 'electode paste composition'::text)"
              Rows Removed by Index Recheck: 14904
              Heap Blocks: exact=16199
              Buffers: shared hit=407635
              ->  Bitmap Index Scan on title_trgm  (cost=0.00..2701.84 rows=229045 width=0) (actual time=35543.907..35543.907 rows=44716 loops=1)
                    Index Cond: (title % 'electode paste composition'::text)"
                    Buffers: shared hit=362988
Planning Time: 0.084 ms
Execution Time: 36356.187 ms

The same query using a tsquery takes less than 2.5 seconds.

If you want to refer to how other products do it, you should include enough information that we can actually figure out what you mean. — jjanes
– jjanes, Commented Feb 11, 2020 at 22:40

Laurenz Albe · Accepted Answer · 2020-02-11 16:38:51Z

1

Assuming that your words are longer than one character, I'd recommend trigram indexes:

CREATE EXTENSION pg_trgm;

CREATE INDEX ON atable USING gin (textcol gin_trgm_ops);

SELECT * FROM atable WHERE textcol % 'search string';

% is the similarity operator.

answered Feb 11, 2020 at 16:38

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Tomer Over a year ago

Unfortunately, trigram search performance is not good enough.

Laurenz Albe Over a year ago

Seems unlikely. Perhaps your search strings are too short or too general. As I wrote, words consisting of a single letter won't work, but then that is not the real-world use case, I'd expect. Hard to say anything more without - say - the EXPLAIN (ANALYZE, BUFFERS) output of such an execution that is "too slow".

Tomer Over a year ago

I've added the info you requested to the description, please take a look.

Laurenz Albe Over a year ago

Try to VACUUM the table to optimize the index, although I don't expect that to have a big impact. Of course it is slower than a full text search; it is doing more work. If the trigram index is not good enough, you are out of options.

Collectives™ on Stack Overflow

Postgres Text Search with Additional Words/Tokens

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related