1

Is it possible to get an array of elasticsearch document id while group by, i.e

Current output

"aggregations": {,
        "types": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "Text Document",
                    "doc_count": 3310
                },
                {
                    "key": "Unknown",
                    "doc_count": 15
                },
                {
                    "key": "Document",
                    "doc_count": 13
                }
            ]
        }
    }

Desired output

"aggregations": {,
        "types": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "Text Document",
                    "doc_count": 3310,
                    "ids":["doc1","doc2", "doc3"....]
                },
                {
                    "key": "Unknown",
                    "doc_count": 15,
                    "ids":["doc11","doc12", "doc13"....]
                },
                {
                    "key": "Document",  
                    "doc_count": 13
                    "ids":["doc21","doc22", "doc23"....]
                }
            ]
        }
    }

Not sure if this is possible in elasticsearch or not, below is my aggregation query:

{
    "size": 0,
    "aggs": {
        "types": {
            "terms": {
                "field": "docType",
                "size": 10
            }
        }
    }
}

Elasticsearch version: 6.3.2

3 Answers 3

3

You can use top_hits aggregation which will return all documents under an aggregation. Using source filtering you can select fields under hits

Query:

  "aggs": {
    "district": {
      "terms": {
        "field": "docType",
        "size": 10
      },
      "aggs": {
        "docs": {
          "top_hits": {
            "size": 10,
            "_source": ["ids"]
          }
        }
      }
    }
  }
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, it is working, is it possible to hide metadata of the document i.e "_index", "_type", "_id", "_score", "_source". I just need _id of doc not all other metadata
@RaghuChahar unfortunately no, metadata cannot be removed
0

For anyone interested, another solution is to create a custom key value using a script to create a string of delineated values from the doc, including the id. It may not be pretty, but you can then parse it out later - and if you just need something minimal like the doc id, it may be worth it.

{
    "size": 0,
    "aggs": {
        "types": {
            "terms": {
                "script": "doc['docType'].value+'::'+doc['_id'].value",
                "size": 10
            }
        }
    }
}

Comments

0

I suppose that easiest way is to use scripted_metric aggregation, though it may seem a bit complicated at first:

"aggs": {
  "types": {
    "terms": {
      "field": "docType",
      "size": 10
    },
    "aggs": {
      "ids": {
        "scripted_metric": {
          "init_script": "state.ids = []",
          "map_script": "state.ids.add(doc['_id'].value)",
          "combine_script": "state",
          "reduce_script": "def result = []; for (state in states) result.addAll(state.ids); return result;"
        }
      }
    }
  }
}

This script should result in what you are looking for:

"aggregations" : {
  "types" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key": "Text Document",
        "doc_count": 3310,
        "ids":["doc1", "doc2", "doc3", ...]
      },
      {
        "key": "Unknown",
        "doc_count": 15,
        "ids":["doc11", "doc12", "doc13"...]
      },
      {
        "key": "Document",  
        "doc_count": 13
        "ids":["doc21", "doc22", "doc23"...]
      }
    ]
  }
}

Those four scripts are executed in following order:

1) init_script
You are provided with object state on which you can create any properties you want that you can use later.
This script is optional.
2) map_script
You have access to previously initialized state object and also to doc object, that references current document.
This script is executed for each document in current bucket but with the same state object so you need to introduce some logic that will collect result data.
3) combine_script
As those document may be spread across multiple shards (computers/processes), this script allows you to aggregate data collected from all documents on current shard before they are passed to aggregation across all shards.
In this case, we already aggregated ids in previous script into provided state object and so we can return that object right away but usually this step would be used when you want to calculate e.g. min or max value of some field and in such case, you only store values of those fields in previous script and do all the calculations here.
This script is executed after mapping on each document on current shard is done.
4) reduce_script
And finally, this script is executed after all shards returned theirs data and only job here is to combine those data in some way and return result.
You are provided with states object which contains results of previous script executions on all shards.

Hope this helps and that it is not too late to post it. I was struggling with similar task as well and it is interesting, that there is still no clear answer anywhere to it.

Link to official documentation is here if anyone want to learn a bit more about how it works.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.