3

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.

POST to localhost:9200/_search

{
"query" : {
               "match_all" : { },
               "filtered" : {
                           "filter" : {
                                   "regexp": {
                                        "url":".*info-for/media.*" 
                                    }
                          }
                }
         },
}

This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?

1 Answer 1

8

First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.

This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/

Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.

What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.

Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis

I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com

A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.

Sign up to request clarification or add additional context in comments.

4 Comments

A simpler option is to map this field as a multi field with a non-analyzed version, and run the regexp filter on the not-analyzed field. In general, the regexp filter makes more sense on a non analyzed field.
That'd still be a very expensive query to execute.
Thanks @AlexBrasetvik I'm having some difficulty POSTing a JSON version of the mapping/analyzer config to my index _settings endpoint. It can't find the analyzer I've declared. Sample JSON would be really helpful if you have it, thanks.
@AlexBrasetvik why would it still be expensive to execute regex on non_analyzed fields?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.