2

I got a strange result searching for an expression like pro-physik.de with tsquery.

If I ask for pro-physik:* by tsquery I want to get all entries starting with pro-physik. Unfortunately those entries with pro-physik.de are missing.

Here are 2 examples to demonstrate the problem:

Query 1:

select 
    to_tsvector('simple', 'pro-physik.de') @@ 
    to_tsquery('simple', 'pro-physik:*') = true

Result 1: false (should be true)

Query 2:

select 
    to_tsvector('simple', 'pro-physik.de') @@
    to_tsquery('simple', 'pro-p:*') = true

Result 2: true

Has anybody an idea how I could solve this problem?

1 Answer 1

4

The core of the problem is that the parser will parse pro-physik.de as a hostname:

SELECT alias, token FROM ts_debug('simple', 'pro-physik.de');

 alias |     token
-------+---------------
 host  | pro-physik.de
(1 row)

Compare this:

SELECT alias, token FROM ts_debug('simple', 'pro-physik-de');
      alias      |     token
-----------------+---------------
 asciihword      | pro-physik-de
 hword_asciipart | pro
 blank           | -
 hword_asciipart | physik
 blank           | -
 hword_asciipart | de
(6 rows)

Now pro-physik and pro-p are not hostnames, so you get

SELECT to_tsquery('simple', 'pro-physik:*');
              to_tsquery
---------------------------------------
 'pro-physik':* & 'pro':* & 'physik':*
(1 row)

SELECT to_tsquery('simple', 'pro-p:*');
         to_tsquery
-----------------------------
 'pro-p':* & 'pro':* & 'p':*
(1 row)

The first tsquery will not match because physik is not a prefix of pro-physik.de, and the second will match because pro-p, pre and p all three are prefixes.

As a workaround, use full text search like this:

select 
   to_tsvector('simple', replace('pro-physik.de', '.', ' ')) @@ 
   to_tsquery('simple', replace('pro-physik:*', '.', ' '))
Sign up to request clarification or add additional context in comments.

3 Comments

The main problem seems to be not the dot but rather the behavior of the ts-parser to avoid splitting hostnames. The same thing is with e-mail-addresses:
Do you or anyone else have any idea how to configure the fulltext search ignoring hostnames, e-mail-addresses e.g.?
You cannot do that unless you want to write a new parser (in C). I told you the workaround: replace the dot with a space, then the parser will find no hostnames. Just apply replace(<string>, '.', ' ') to the strings before you use them in full text search.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.