0

I have an aggregate query to make which buckets the city name of a country. The query (which I make in sense) is as below:

GET test/_search
{

  "query" : {
"bool" : {
  "must" : {
    "match" : {
      "name.autocomplete" : {
        "query" : "new yo",
        "type" : "boolean"
      }
    }
  },
  "must_not" : {
    "term" : {
      "source" : "old"
    }
  }
}
  },
  "aggregations" : {
"city_name" : {
  "terms" : {
    "field" : "cityname.raw",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "field" : "countryname.raw"
         }
       }
     }
   }
 }
}

Now in the documents New Yorkoccurs two time one with an extra trailing space. The aggregation result which I get is as below:

{
     "key": "New York",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  },
  {
     "key": "New York ",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  }

I need the both New York to be treated the same. Is there any way I can query that I get both of them in the same group. Any things which trims the trailing spaces will do I guess. Could not find anything though. Thanks

2
  • Is there a way for you to clean this up (i.e. trim) before sending the document to Elasticsearch? Commented Oct 1, 2015 at 6:44
  • I know, cleaning can make it right. But I need a solution while querying!! Commented Oct 1, 2015 at 6:45

1 Answer 1

2

The ideal case is to clean up your fields before indexing your documents. If that's not an option, you can still clean them after the fact using (e.g.) the update-by-query plugin...

Or, but that's a bit worse performance-wise, use a terms aggregation with a script instead of a field, like this:

...
"aggregations" : {
"city_name" : {
  "terms" : {
    "script" : "doc['cityname.raw'].value.trim()",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "script" : "doc['countryname.raw'].value.trim()",
         }
       }
     }
   }
 }
}

Yet another solution would be to change from not_analyzed to an analyzed string but create a custom analyzer that preserves the token (as not_analyzed does) using the keyword analyzer with a trim token filter.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "trimmer": {
          "type": "custom",
          "filter": [ "trim" ],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "cityname": {
          "type": "string",
          "analyzer": "trimmer"
        },
        "countryname": {
          "type": "string",
          "analyzer": "trimmer"
        }
      }
    }
  }
}

If you index cityname: "New York City " the token that is going to be stored will be trimmed to "New York City"

Sign up to request clarification or add additional context in comments.

1 Comment

the script thing worked perfectly fine. Thanks a lot. And I am going to use the Trim token filter too, that should work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.