Elasticsearch - Aggregating on multiple fields, filtering on count and ordering on count

Question

I'm a bit new to aggregations and I want to create an equivalent to the following SQL:

select fullname, natcode, count(1) from table where birthdate = '18-sep-1993' group by fullname, natcode having count(1) > 2 order by count(1) desc

So, if I have the following data:

I need to get the results as:

As you can see, the results are grouped by fullname and natcode, have count>2 and are ordered by count

I've managed to form the following query:

{
  "size": 0,
  "aggs": {
    "profs": {
      "filter": {
        "term": {
          "birthDate": "18-Sep-1993"
        }
      },
      "aggs": {
        "name_count": {
          "terms": {
            "field": "fullName.raw"
          },
          "aggs": {
            "nat_count": {
              "terms": {
                "field": "natCode"
              },
              "aggs": {
                "my_filter": {
                  "bucket_selector": {
                    "buckets_path": {
                      "the_doc_count": "_count"
                    },
                    "script": {
                      "source": "params.the_doc_count>2"
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

What is achieved: It is filtering on date, creating bucket on fullname (name_count) and sub-bucket on natcode (nat_count) and filtering natcode bucket on doc count.

The problem with this: I can see empty name_count buckets also. I only want buckets that have the required count. Following is the sample of results

"aggregations": {
    "profs": {
      "doc_count": 3754,
      "name_count": {
        "doc_count_error_upper_bound": 4,
        "sum_other_doc_count": 3732,
        "buckets": [
          {
            "key": "JOHN SMITH",
            "doc_count": 3,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "111",
                  "doc_count": 3
                }
              ]
            }
          },
          {
            "key": "MIKE CAIN",
            "doc_count": 3,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "205",
                  "doc_count": 3
                }
              ]
            }
          },
          {
            "key": "JULIA ROBERTS",
            "doc_count": 2,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": []
            }
          },
          {
            "key": "JAMES STEPHEN COOK",
            "doc_count": 2,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": []
            }
          }

In the results, I don't want the last two names (JULIA ROBERTS and JAMES STEPHEN COOK) to show up

Additionally what is missing: The ordering on the group count at the end. I'd want the group (fullname, natcode) with the most count to show up

Required further ahead: The grouping needs to be done on a couple of more fields, so they'd be like 4 fields.

Please excuse if I might have used any wrong terms. Hopefully you get the idea of what help is required. Thanks

Kamal Kunjapur · Accepted Answer · 2019-03-19 21:37:06Z

1

Below is how your query should be.

Required Query (Final Answer)

POST <your_index_name>/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "birthDate": "18-sep-1993"
        }
      }
    }
  }, 
  "aggs": {
    "groupby_fullname": {
      "terms": {
        "field": "fullName.raw",
        "size": 2000
      },
      "aggs": {
        "natcode_filter": {
          "bucket_selector": {
            "buckets_path": {
              "hits": "groupby_natcode._bucket_count"
            },
            "script": "params.hits > 0"
          }
        },
        "groupby_natcode": {
          "terms": {
            "field": "natCode",
            "size": 2000,
            "min_doc_count": 2
          }
        }
      }
    }
  }
}

Alternative Solution: (Similar to select distinct)

As last resort, what I can come up with is to do something like select distinct based on fullName + "_" + natCode. So basically your keys would be of form JOHN SMITH_111. This does give you accurate result except that the keys would be in this form.

POST <your_index_name>/_search
{  
   "size":0,
   "query":{  
      "bool":{  
         "filter":{  
            "term":{  
               "birthDate":"18-sep-1993"
            }
         }
      }
   },
   "aggs":{  
      "name_count":{  
         "terms":{  
            "script":{  
               "inline":"doc['fullName.raw'].value + params.param + doc['natCode'].value",
               "lang":"painless",
               "params":{  
                  "param":"_"
               }
            }
         },
         "aggs":{  
            "my_filter":{  
               "bucket_selector":{  
                  "buckets_path":{  
                     "doc_count":"_count"
                  },
                  "script":"params.doc_count > 2"
               }
            }
         }
      }
   }
}

Hope it helps.

edited Mar 19, 2019 at 21:37

answered Mar 13, 2019 at 18:42

Kamal Kunjapur

8,9182 gold badges26 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

19 Comments

Maarab Over a year ago

The query you have advised is not giving me what is expected. I've updated more details of sample data and the results that are required.

Kamal Kunjapur Over a year ago

@Maarab I got the cases of the fieldname (typed fullname instead of fullName wrong. Could you please test the query now and let me know if it works. Running the above query (both of them), on the data you've mentioned in the question, gives the result you are looking for.

Maarab Over a year ago

fullname in the data sample is db column name. I ran the queries and not getting any bucket in the results (Sorry don't know how to format) Result 1:

"aggregations": {     "profs": {       "doc_count": 53,       "name_count": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 42,         "buckets": []       }     }   }

Result 2:

"aggregations": {     "name_count": {       "doc_count_error_upper_bound": 0,       "sum_other_doc_count": 42,       "buckets": []     }   }

Kamal Kunjapur Over a year ago

@Maarab Apologies. I've mistakenly used the "birthDate": "18-Sep-1933" instead of "birthDate": "18-Sep-1993" which I've corrected in the answers. Please try once again, the same queries and let me know if it works out. If it still doesn't could you please share me your mapping details.

Maarab Over a year ago

Yes, I just ran it and it kinda seems to be giving me what I require but I need to test a bit more to conclude. Also your suggestion on the filter first is very right. The second query is how the query should be structured i.e. filter first and then apply aggregations. Will try to post the final update after testing.

|

Collectives™ on Stack Overflow

Elasticsearch - Aggregating on multiple fields, filtering on count and ordering on count

1 Answer 1

Required Query (Final Answer)

Alternative Solution: (Similar to select distinct)

19 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Required Query (Final Answer)

Alternative Solution: (Similar to select distinct)

19 Comments

Your Answer

Sign up or log in

Post as a guest

Related