Elasticsearch 7.8 Nested Aggregation not returning correct data

Question

I have been struggling for a week trying to get correct data out of an Elasticsearch nested aggregtation index. Below is my index mapping and two sample documents inserted. What i want to find is:

Match all documents with the field xforms.sentence.tokens.value equal to 24
Within the matched set of documents do a count of matches grouped by xforms.sentence.tokens.tag where xforms.sentence.tokens.value equal to 24

So as an example in the inserted documents below the output i expect is:

{"JJ": 1, "NN": 1}

{
  "_doc": {
    "_meta": {},
    "_source": {},
    "properties": {
      "originalText": {
        "type": "text"
      },
      "testDataId": {
        "type": "text"
      },
      "xforms": {
        "type": "nested",
        "properties": {
          "sentence": {
            "type": "nested"
          },
          "predicate": {
            "type": "nested"
          }
        }
      },
      "corpusId": {
        "type": "text"
      },
      "row": {
        "type": "text"
      },
      "batchId": {
        "type": "text"
      },
      "processor": {
        "type": "text"
      }
    }
  }
}

A sample doc inserted is as follows:

{
    "_id": "28",
    "_source": {
        "testDataId": "5e97e9bef033448b893e485baa0fdf15",
        "originalText": "Some text with the word 24",
        "xforms": [{
            "sentence": {
                "tokens": [{
                        "lemma": "Some",
                        "index": 1,
                        "after": " ",
                        "tag": "JJ",
                        "value": "Some"
                    },
                    {
                        "lemma": "text",
                        "index": 2,
                        "after": " ",
                        "tag": "NN",
                        "value": "text"
                    },
                    {
                        "lemma": "with",
                        "index": 3,
                        "after": " ",
                        "tag": "NN",
                        "value": "with"
                    },
                    {
                        "lemma": "the",
                        "index": 4,
                        "after": "",
                        "tag": "CD",
                        "value": "the"
                    },
                    {
                        "lemma": "word",
                        "index": 5,
                        "after": " ",
                        "tag": "CC",
                        "value": "word"
                    },
                    {
                        "lemma": "24",
                        "index": 6,
                        "after": " ",
                        "tag": "JJ",
                        "value": "24"
                    }
                ],
                "type": "RAW"
            },
            "originalSentence": "Some text with the word 24 in it",
            "id": "e724611d8c024bcb8f0158b60e3df87e"
        }]
    }
},
{
    "_id": "56",
    "_source": {
        "testDataId": "5e97e9bef033448b893e485baa0fad15",
        "originalText": "24 word",
        "xforms": [{
            "sentence": {
                "tokens": [{
                        "lemma": "24",
                        "index": 1,
                        "after": " ",
                        "tag": "NN",
                        "value": "24"
                    },
                    {
                        "lemma": "word",
                        "index": 2,
                        "after": " ",
                        "tag": "JJ",
                        "value": "word"
                    }
                ],
                "type": "RAW"
            },
            "originalSentence": "24 word",
            "id": "e724611d8c024bcb8f0158b60e3d123"
        }]
    }
}

Jozef - Spatialized.io · Accepted Answer · 2020-08-16 15:24:24Z

1

Expanding on @Gibbs's answer, @N Kiram you'll need to set the tokens as nested too:

{
  "xforms":{
    "type":"nested",
    "properties":{
      "sentence":{
        "type":"nested",
        "properties":{
          "tokens":{              <----
            "type":"nested"
          }
        }
      },
      "predicate":{
        "type":"nested"
      }
    }
  }
}

Then and only then will your aggs yield the correct counts:

{
  "aggregations":{
    "xforms":{
      "doc_count":8,
      "inner":{
        "doc_count":2,
        "tag_count":{
          "doc_count_error_upper_bound":0,
          "sum_other_doc_count":0,
          "buckets":[
            {
              "key":"JJ",
              "doc_count":1
            },
            {
              "key":"NN",
              "doc_count":1
            }
          ]
        }
      }
    }
  }
}

Side note: you'll have to reindex in order for the changed mapping to apply.

answered Aug 16, 2020 at 15:24

Jozef - Spatialized.io

17k4 gold badges29 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

N Kiran Kumar Kowlgi Over a year ago

thanks that did the trick. The ES documentation on nested fields does seem a bit off as the examples only ever talk about one level of nested field. A secondary question if i may? On the same index is it possible perform an aggregation filder with wildcards i.e instead of 24 i do a 24*?

Gibbs · Accepted Answer · 2020-08-16 03:13:38Z

{
  "aggs": {
    "xforms": {
      "nested": { //Nested aggregation
        "path": "xforms.sentence"
      },
      "aggs": {
        "inner": { //Counting only within the matching doc
          "filter": {
            "bool": {
              "filter": { //Filtering docs with value=24
                "terms": {
                  "xforms.sentence.tokens.value": [
                    "24"
                  ]
                }
              }
            }
          },
        "aggs" : {
          "tag_count":{ //On filtered doc, doing terms aggregation on tag's keyword version as tag is of type text
            "terms":{
              "field":"xforms.sentence.tokens.tag.keyword"
            }
          }
        }
        }
      }
    }
  }
}

It provides the below output

"aggregations": {
        "xforms": {
            "doc_count": 2,
            "inner": {
                "doc_count": 2,
                "tag_count": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "JJ",
                            "doc_count": 2
                        },
                        {
                            "key": "NN",
                            "doc_count": 2
                        },
                        {
                            "key": "CC",
                            "doc_count": 1
                        },
                        {
                            "key": "CD",
                            "doc_count": 1
                        }
                    ]
                }
            }
        }
    }

Yes but that response is incorrect. Since "24" only is tagged against 'JJ' once the count should have been 1 and not 2

Collectives™ on Stack Overflow

Elasticsearch 7.8 Nested Aggregation not returning correct data

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related