2

I'm trying to get the distinct values and their amount in ElasticSearch.

This can be done via:

"distinct_publisher": {
        "terms": {
            "field": "publisher", "size": 0
        }
    }

The problem I've is that it counts the terms, but if there are values in publishers separated via a space e.g.: "Chicken Dog" and 5 documents have this value in the publisher field, then I get 5 for Chicken and 5 for Dog:

"buckets" : [
            {
                "key" : "chicken",
                "doc_count" : 5
            },
            {
                "key" : "dog",
                "doc_count" : 5
            },
            ...
        ]

But I want to get as the result:

"buckets" : [
            {
                "key" : "Chicken Dog",
                "doc_count" : 5
            }
        ]

1 Answer 1

5

The reason you're getting 5 buckets for each of chicken and dog is because your documents were analyzed at the time that you indexed them.

This means elasticsearch did some small processing to turn Chicken Dog into chicken and dog (lowercase, and tokenize on space). You can see how elasticsearch will analyze a given piece of text into searchable tokens by using the Analyze API, for example:

curl -XGET 'localhost:9200/_analyze?&text=Chicken+Dog'

In order to aggregate over the "raw" distinct values, you need to utilize the not_analyzed mapping so elasticsearch doesn't do its usual processing. This reference may help. You may need to reindex your data to apply the not_analyzed mapping to get the result you want.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot! This was absolutely what I was looking for and also a detailed and very good answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.