0

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -

{"name":"A",
     "class":"10th",
     "subjects":{
         "S1":92,
         "S2":92,
         "S3":92,
     }
}

We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.

We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.

We wrote the following script but this only works if we know all possible subjects.

GET student_data_v1_1/_search

{ "query" :
    {"match" : 
         { "class" : "' + query + '" }}, 
         "aggs" : { "my_buckets" : { "terms" : 
         { "field" : "subjects", "size":10000 },
         "aggregations": {"the_avg": 
                      {"avg": { "field": "subjects.value" }}} }},
          "size" : 0 }'

but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -

{"name":"A",
     "class":"10th",
     "subjects":{
         "value":93
     }
}

An alternate form the document is present is that the subject is a list of dictionaries -

    {"name":"A",
     "class":"10th",
     "subjects":[
         {"S1":92},
         {"S2":92},
         {"S3":92},
     ]
}

Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======

After updating the document to hold weights for each subject -

        {
class": "10th",
   "subject": [
 {
   "name": "s1",
   "marks": 90,
   "weight":30
 },
 {
   "name": "s2",
  "marks": 80,
   "weight":70
 }
   ]}

I have updated the query to be -

{
  "query": {
    "match": {
      "class": "10th"
}
  },
  "aggs": {
    "subjects": {
      "nested": {
        "path": "scores"
      },    
      "aggs": {
        "subjects": {
          "terms": {
            "field": "subject.name"
      },
      "aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
      }
    }
  }
}
      },
      "size": 0
    }

but it throws the error-

{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
2
  • Can you share the mapping of index in both the above cases. You are trying to apply aggregation on subject.value whereas value property doesn't exist under subject Commented Nov 26, 2018 at 16:46
  • @NishantSaini I have updated the question with the sample document with "value" property. It only works if we know all the possible key pairs or/and select only 1 key/document. Commented Nov 26, 2018 at 18:24

1 Answer 1

1

To achieve the required result I would suggest you to keep your index mapping as follows:

{
  "properties": {
    "class": {
      "type": "keyword"
    },
    "subject": {
      "type": "nested",
      "properties": {
        "marks": {
          "type": "integer"
        },
        "name": {
          "type": "keyword"
        }
      }
    }
  }
}

In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.

Sample doc:

{
  "class": "10th",
  "subject": [
    {
      "name": "s1",
      "marks": 90
    },
    {
      "name": "s2",
      "marks": 80
    }
  ]
}

Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:

{
  "query": {
    "match": {
      "class": "10th"
    }
  },
  "aggs": {
    "subjects": {
      "nested": {
        "path": "subject"
      },
      "aggs": {
        "subjects": {
          "terms": {
            "field": "subject.name"
          },
          "aggs": {
            "avg_score": {
              "avg": {
                "field": "subject.marks"
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.

Sample result:

{
  "took": 25,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": [

    ]
  },
  "aggregations": {
    "subjects": {
      "doc_count": 6,
      "subjects": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "s1",
            "doc_count": 3,
            "avg_score": {
              "value": 80
            }
          },
          {
            "key": "s2",
            "doc_count": 2,
            "avg_score": {
              "value": 75
            }
          },
          {
            "key": "s3",
            "doc_count": 1,
            "avg_score": {
              "value": 80
            }
          }
        ]
      }
    }
  }
}

As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.

UPDATE to include weighted_avg:

{
  "query": {
    "match": {
      "class": "10th"
    }
  },
  "aggs": {
    "subjects": {
      "nested": {
        "path": "subject"
      },
      "aggs": {
        "subjects": {
          "terms": {
            "field": "subject.name"
          },
          "aggs": {
            "avg_score": {
              "avg": {
                "field": "subject.marks"
              }
            },
            "weighted_grade": {
              "weighted_avg": {
                "value": {
                  "field": "subject.marks"
                },
                "weight": {
                  "field": "subject.weight"
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}
Sign up to request clarification or add additional context in comments.

8 Comments

thanks @nishantsaini this worked exactly as needed. I have added a "weight" key while ingesting the documents - { "class": "10th", "subject": [ { "name": "s1", "marks": 90, "weight":30 }, { "name": "s2", "marks": 80, "weight":70 } ] } but on updating the query to weighted average the score is throwing an error. I have updated the nested "aggs" to - "aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
I have updated the question to clearly show what I mean in the above comment
your update doesn't cover the mention of the weight that I have added in the edit update. I want to aggregate the marks based on weights assigned subject-wise.
Your query has malformed json. Otherwise it seemed to be correct.
Can you copy paste your complete query as a seperate question. I or someone else can then look at it to find the issue.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.