elasticsearch multi_match with regexp

Question

I'm trying to rebuild my elastic search query, because I found that I don't receiving all documents I am looking for.

So, let's assume that I have document like this:

{
  "id": 1234,
  "mail_id": 5,
  "sender": "john smith",
  "email": "[email protected]",
  "subject": "somesubject",
  "txt": "abcdefgh\r\n",
  "html": "<div dir=\"ltr\">abcdefgh</div>\r\n",
  "date": "2017-07-020 10:00:00"
}

I have few millions documents like this and now I am trying to search for some by query like this:

{
  "sort": [
    {
      "date": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "bool": {
      "minimum_should_match": "100%",
      "should": [
        {
          "multi_match": {
            "type": "cross_fields",
            "query": "abcdefgh johnsmith john smith",
            "operator": "and",
            "fields": [
              "email.full",
              "sender",
              "subject",
              "txt",
              "html"
            ]
          }
        }
      ],
      "must": [
        {
          "ids": {
            "values": [
              "1234"
            ]
          }
        },
        {
          "term": {
            "mail_id": 5
          }
        }
      ]
    }
  }
}

For query like this it is all fine, but when i want to find document by query 'gmail' or 'com', it would not work.

"query": "abcdefgh johnsmith john smith gmail"
"query": "abcdefgh johnsmith john smith com"

It will work only when I will search for 'gmail.com' "query": "abcdefgh johnsmith john smith gmail.com"

So... I was trying to attach analyzer

...
"type": "cross_fields",
"query": "abcdefgh johnsmith john smith",
"operator": "and",
"analyzer": "simple",
...

Does not help at all. The only way I am able to find this document was define regex, e.g.:

"minimum_should_match": 1,
"should": [
  {
    "multi_match": {
      "type": "cross_fields",
      "query": "fdsfs wukamil kam wuj gmail.com",
      "operator": "and",
      "fields": [
        "email.full",
        "sender",
        "subject",
        "txt",
        "html"
      ]
    }
  },
  {
    "regexp": {
      "email.full": ".*gmail.*"
    }
  }
],

but in this approach I will have to add (queries * fields) regexp objects to my json, so I don't think it will be the best solution. I also know about wildcard but it will be mess just like with regexps.

If anyone had problem like this and know the solution I will be thankful for help :)

Can you show what your mapping for the email field is inside ES? — Peter Featherstone
– Peter Featherstone, Commented Jul 20, 2017 at 10:28

Peter Featherstone · Accepted Answer · 2017-07-20 11:09:47Z

If you run your search term through the standard analyser you can see what tokens [email protected] gets broken down to. You can do this directly in your browser using the below URL:

https://<your_site>:<es_port>/_analyze/?analyzer=standard&[email protected]

This will show that the email gets broken down into the following tokens:

{

    "tokens": [
        {
            "token": "johnsmith",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "gmail.com",
            "start_offset": 10,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]

}

So this shows that you can't search using just gmail but you can using gmail.com. To split your text on the dot too you can update your mapping to use the Simple Analyzer which says:

The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.

We can show this works by updating our URL from earlier to use the simple analyser as below:

https://<your_site>:<es_port>/_analyze/?analyzer=simple&[email protected]

Which returns:

{

    "tokens": [
        {
            "token": "johnsmith",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 1
        },
        {
            "token": "gmail",
            "start_offset": 10,
            "end_offset": 15,
            "type": "word",
            "position": 2
        },
        {
            "token": "com",
            "start_offset": 16,
            "end_offset": 19,
            "type": "word",
            "position": 3
        }
    ]

}

This analyser may not be the right tool for the job as it ignores any non-letter values but you can play with analysers and tokenisers until you get what you need.

Collectives™ on Stack Overflow

elasticsearch multi_match with regexp

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related