How to ignore trailing white-spaces while making an aggregation query in ElasticSearch

Question

I have an aggregate query to make which buckets the city name of a country. The query (which I make in sense) is as below:

GET test/_search
{

  "query" : {
"bool" : {
  "must" : {
    "match" : {
      "name.autocomplete" : {
        "query" : "new yo",
        "type" : "boolean"
      }
    }
  },
  "must_not" : {
    "term" : {
      "source" : "old"
    }
  }
}
  },
  "aggregations" : {
"city_name" : {
  "terms" : {
    "field" : "cityname.raw",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "field" : "countryname.raw"
         }
       }
     }
   }
 }
}

Now in the documents New Yorkoccurs two time one with an extra trailing space. The aggregation result which I get is as below:

{
     "key": "New York",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  },
  {
     "key": "New York ",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  }

I need the both New York to be treated the same. Is there any way I can query that I get both of them in the same group. Any things which trims the trailing spaces will do I guess. Could not find anything though. Thanks

Is there a way for you to clean this up (i.e. trim) before sending the document to Elasticsearch? — Val
– Val, Commented Oct 1, 2015 at 6:44
I know, cleaning can make it right. But I need a solution while querying!! — Nihal Sharma
– Nihal Sharma, Commented Oct 1, 2015 at 6:45

Val · Accepted Answer · 2015-10-01 07:22:51Z

2

The ideal case is to clean up your fields before indexing your documents. If that's not an option, you can still clean them after the fact using (e.g.) the update-by-query plugin...

Or, but that's a bit worse performance-wise, use a terms aggregation with a script instead of a field, like this:

...
"aggregations" : {
"city_name" : {
  "terms" : {
    "script" : "doc['cityname.raw'].value.trim()",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "script" : "doc['countryname.raw'].value.trim()",
         }
       }
     }
   }
 }
}

Yet another solution would be to change from not_analyzed to an analyzed string but create a custom analyzer that preserves the token (as not_analyzed does) using the keyword analyzer with a trim token filter.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "trimmer": {
          "type": "custom",
          "filter": [ "trim" ],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "cityname": {
          "type": "string",
          "analyzer": "trimmer"
        },
        "countryname": {
          "type": "string",
          "analyzer": "trimmer"
        }
      }
    }
  }
}

If you index cityname: "New York City " the token that is going to be stored will be trimmed to "New York City"

edited Oct 1, 2015 at 7:22

answered Oct 1, 2015 at 6:48

Val

218k14 gold badges377 silver badges384 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nihal Sharma Over a year ago

the script thing worked perfectly fine. Thanks a lot. And I am going to use the Trim token filter too, that should work.

Collectives™ on Stack Overflow

How to ignore trailing white-spaces while making an aggregation query in ElasticSearch

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related