1

I'm trying to rebuild my elastic search query, because I found that I don't receiving all documents I am looking for.

So, let's assume that I have document like this:

{
  "id": 1234,
  "mail_id": 5,
  "sender": "john smith",
  "email": "[email protected]",
  "subject": "somesubject",
  "txt": "abcdefgh\r\n",
  "html": "<div dir=\"ltr\">abcdefgh</div>\r\n",
  "date": "2017-07-020 10:00:00"
}

I have few millions documents like this and now I am trying to search for some by query like this:

{
  "sort": [
    {
      "date": {
        "order": "desc"
      }
    }
  ],
  "query": {
    "bool": {
      "minimum_should_match": "100%",
      "should": [
        {
          "multi_match": {
            "type": "cross_fields",
            "query": "abcdefgh johnsmith john smith",
            "operator": "and",
            "fields": [
              "email.full",
              "sender",
              "subject",
              "txt",
              "html"
            ]
          }
        }
      ],
      "must": [
        {
          "ids": {
            "values": [
              "1234"
            ]
          }
        },
        {
          "term": {
            "mail_id": 5
          }
        }
      ]
    }
  }
}

For query like this it is all fine, but when i want to find document by query 'gmail' or 'com', it would not work.

"query": "abcdefgh johnsmith john smith gmail"
"query": "abcdefgh johnsmith john smith com"

It will work only when I will search for 'gmail.com' "query": "abcdefgh johnsmith john smith gmail.com"

So... I was trying to attach analyzer

...
"type": "cross_fields",
"query": "abcdefgh johnsmith john smith",
"operator": "and",
"analyzer": "simple",
...

Does not help at all. The only way I am able to find this document was define regex, e.g.:

"minimum_should_match": 1,
"should": [
  {
    "multi_match": {
      "type": "cross_fields",
      "query": "fdsfs wukamil kam wuj gmail.com",
      "operator": "and",
      "fields": [
        "email.full",
        "sender",
        "subject",
        "txt",
        "html"
      ]
    }
  },
  {
    "regexp": {
      "email.full": ".*gmail.*"
    }
  }
],

but in this approach I will have to add (queries * fields) regexp objects to my json, so I don't think it will be the best solution. I also know about wildcard but it will be mess just like with regexps.

If anyone had problem like this and know the solution I will be thankful for help :)

2
  • Can you show what your mapping for the email field is inside ES? Commented Jul 20, 2017 at 10:28
  • email.full is just {"type" : "text"} Commented Jul 20, 2017 at 11:00

1 Answer 1

1

If you run your search term through the standard analyser you can see what tokens [email protected] gets broken down to. You can do this directly in your browser using the below URL:

https://<your_site>:<es_port>/_analyze/?analyzer=standard&[email protected]

This will show that the email gets broken down into the following tokens:

{

    "tokens": [
        {
            "token": "johnsmith",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "gmail.com",
            "start_offset": 10,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]

}

So this shows that you can't search using just gmail but you can using gmail.com. To split your text on the dot too you can update your mapping to use the Simple Analyzer which says:

The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.

We can show this works by updating our URL from earlier to use the simple analyser as below:

https://<your_site>:<es_port>/_analyze/?analyzer=simple&[email protected]

Which returns:

{

    "tokens": [
        {
            "token": "johnsmith",
            "start_offset": 0,
            "end_offset": 9,
            "type": "word",
            "position": 1
        },
        {
            "token": "gmail",
            "start_offset": 10,
            "end_offset": 15,
            "type": "word",
            "position": 2
        },
        {
            "token": "com",
            "start_offset": 16,
            "end_offset": 19,
            "type": "word",
            "position": 3
        }
    ]

}

This analyser may not be the right tool for the job as it ignores any non-letter values but you can play with analysers and tokenisers until you get what you need.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.