0

I have a table with English sentences. Given a sentence which may contain an additional word, or a distorted word, can I find the closest sentence in the table using Postgres' Text-Search capabilities?

to_tsvector('a b c') @@ plainto_tsquery('a b') returns true

to_tsvector('a b') @@ plainto_tsquery('a b c') returns false

I would like scenario 2 to return true as well.

Notes:

  1. The length of the sentences may be dozens of words. I'm looking for an efficient solution..
  2. Other text search engines such as Elastic/Solr will successfully return the closest result.

More information regrading performance of the trigram index.

EXPLAIN (ANALYSE, BUFFERS)
SELECT
    similarity(title, 'electode paste composition') as sml,
    title
FROM
    table
WHERE
    title % 'electode paste composition'
ORDER BY
    sml DESC;

returns:

Gather Merge  (cost=1880112.22..1902381.94 rows=190870 width=93) (actual time=36355.303..36356.143 rows=5 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=407649
  ->  Sort  (cost=1879112.20..1879350.78 rows=95435 width=93) (actual time=36344.180..36344.180 rows=2 loops=3)
        Sort Key: (similarity(title, 'electode paste composition'::text)) DESC"
        Sort Method: quicksort  Memory: 25kB
        Worker 0:  Sort Method: quicksort  Memory: 25kB
        Worker 1:  Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=407649
        ->  Parallel Bitmap Heap Scan on table  (cost=2759.10..1866325.66 rows=95435 width=93) (actual time=35940.284..36344.141 rows=2 loops=3)
              Recheck Cond: (title % 'electode paste composition'::text)"
              Rows Removed by Index Recheck: 14904
              Heap Blocks: exact=16199
              Buffers: shared hit=407635
              ->  Bitmap Index Scan on title_trgm  (cost=0.00..2701.84 rows=229045 width=0) (actual time=35543.907..35543.907 rows=44716 loops=1)
                    Index Cond: (title % 'electode paste composition'::text)"
                    Buffers: shared hit=362988
Planning Time: 0.084 ms
Execution Time: 36356.187 ms

The same query using a tsquery takes less than 2.5 seconds.

1
  • If you want to refer to how other products do it, you should include enough information that we can actually figure out what you mean. Commented Feb 11, 2020 at 22:40

1 Answer 1

1

Assuming that your words are longer than one character, I'd recommend trigram indexes:

CREATE EXTENSION pg_trgm;

CREATE INDEX ON atable USING gin (textcol gin_trgm_ops);

SELECT * FROM atable WHERE textcol % 'search string';

% is the similarity operator.

Sign up to request clarification or add additional context in comments.

4 Comments

Unfortunately, trigram search performance is not good enough.
Seems unlikely. Perhaps your search strings are too short or too general. As I wrote, words consisting of a single letter won't work, but then that is not the real-world use case, I'd expect. Hard to say anything more without - say - the EXPLAIN (ANALYZE, BUFFERS) output of such an execution that is "too slow".
I've added the info you requested to the description, please take a look.
Try to VACUUM the table to optimize the index, although I don't expect that to have a big impact. Of course it is slower than a full text search; it is doing more work. If the trigram index is not good enough, you are out of options.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.