No response when query elasticsearch with python

Question

I have some code to query specific strings in a field message as below:

"message": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"

Here is my code:

from elasticsearch import Elasticsearch
import json

client = Elasticsearch(['http://192.168.1.114:9200'])

response = client.search(
  index="squidlog-2017.10.29",
  body={
      "query": {
          "match": {
            "message": 'GET'
          }
      }
  }
)

for hit in response['hits']['hits']:
    print json.dumps(hit['_source'], indent=4, sort_keys=True)

When I query with specific strings: GET with template above, everything is ok. But when I want to query something about url in message, I don't receive anything, like for the following query:

body={
      "query": {
          "match": {
            "message": 'pravda'
          }
      }
  }

Is there any problem with slashes in my message when I query? Anyone please give me an advice. Thanks.

Nikolay Vasiliev · Accepted Answer · 2017-10-30 20:01:26Z

You might consider using a different tokenizer, which will make the desired search possible. But let me explain why your query does not return you the result in the second case.

`standard` analyzer and tokenizer

By default standard analyzer consists of standard tokenizer, which will apparently keep the domain name not split by dots. You can try different analyzers and tokenizers with _analyze endpoint, like this:

GET _analyze
{
    "text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}

The response is a list of tokens that ElasticSearch will be using to represent this string while searching. Here it is:

{
   "tokens": [
      {
         "token": "oct",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 0
      }, ...
      {
         "token": "http",
         "start_offset": 59,
         "end_offset": 63,
         "type": "<ALPHANUM>",
         "position": 11
      },
      {
         "token": "www.pravda.ru",
         "start_offset": 66,
         "end_offset": 79,
         "type": "<ALPHANUM>",
         "position": 12
      },
      {
         "token": "science",
         "start_offset": 80,
         "end_offset": 87,
         "type": "<ALPHANUM>",
         "position": 13
      }, ...
   ]
}

As you can see, "pravda" is not in the list of tokens, hence you cannot search for it. You can only search for the tokens that your analyzer emits.

Note that "pravda" is part of the domain name, which is a analyzed as a separate token: "www.pravda.ru".

`lowercase` tokenizer

If you use different tokenizer, for instance, lowercase tokenizer, it will do emit pravda as a token and it will be possible to search for it:

GET _analyze
{
    "tokenizer" : "lowercase",
    "text": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
}

And the list of tokens:

{
   "tokens": [
      {
         "token": "oct",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 0
      }, ...
      {
         "token": "http",
         "start_offset": 59,
         "end_offset": 63,
         "type": "word",
         "position": 4
      },
      {
         "token": "www",
         "start_offset": 66,
         "end_offset": 69,
         "type": "word",
         "position": 5
      },
      {
         "token": "pravda",
         "start_offset": 70,
         "end_offset": 76,
         "type": "word",
         "position": 6
      },
      {
         "token": "ru",
         "start_offset": 77,
         "end_offset": 79,
         "type": "word",
         "position": 7
      },
      {
         "token": "science",
         "start_offset": 80,
         "end_offset": 87,
         "type": "word",
         "position": 8
      }, ...
   ]
}

How to define analyzer before indexing?

To be able to search for such tokens, you have to analyze them during the index phase differently. It means to define a different mapping with different analyzer. Like in this example:

PUT yet_another_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_custom_analyzer": {
               "type": "custom",
               "tokenizer": "lowercase"
            }
         }
      }
   },
   "mappings": {
      "my_type": {
         "properties": {
            "message": {
               "type": "text",
               "fields": {
                  "lowercased": {
                     "type": "text",
                     "analyzer": "my_custom_analyzer"
                  }
               }
            }
         }
      }
   }
}

Here, we first define a custom analyzer with desired tokenizer, and then tell ElasticSearch to index our message field twice via fields feature: implicitly with default analyzer, and explicitly with my_custom_analyzer.

Now we are able to query for the desired token. Request to the original field will give no response:

POST yet_another_index/my_type/_search
{
    "query": {
        "match": {
            "message": "pravda"
        }
    }
}

   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }

But the query to the message.lowercased will succeed:

POST yet_another_index/my_type/_search
{
    "query": {
        "match": {
            "message.lowercased": "pravda"
        }
    }
}

   "hits": {
      "total": 1,
      "max_score": 0.25316024,
      "hits": [
         {
            "_index": "yet_another_index",
            "_type": "my_type",
            "_id": "AV9u1qZmB9pi5Gaw0rj1",
            "_score": 0.25316024,
            "_source": {
               "message": "Oct 29 11:38:46 1893 192.168.1.114 TCP_MISS/200 153925 GET http://www.pravda.ru/science/ - DIRECT/185.103.135.90 text/html"
            }
         }
      ]
   }

There are plenty of options, this solution answers the example you provided. Check out different analyzers and tokenizers to find which one suits you more.

Hope that helps!

Excellent!!!. Understanding more with your explaination, now I can solve my problem. Thanks very much bro.

Collectives™ on Stack Overflow

No response when query elasticsearch with python

1 Answer 1

`standard` analyzer and tokenizer

`lowercase` tokenizer

How to define analyzer before indexing?

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

standard analyzer and tokenizer

lowercase tokenizer

How to define analyzer before indexing?

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

`standard` analyzer and tokenizer

`lowercase` tokenizer