1

I'm a bit new to aggregations and I want to create an equivalent to the following SQL:

select fullname, natcode, count(1) from table where birthdate = '18-sep-1993' group by fullname, natcode having count(1) > 2 order by count(1) desc

So, if I have the following data:
enter image description here

I need to get the results as:
enter image description here

As you can see, the results are grouped by fullname and natcode, have count>2 and are ordered by count

I've managed to form the following query:

{
  "size": 0,
  "aggs": {
    "profs": {
      "filter": {
        "term": {
          "birthDate": "18-Sep-1993"
        }
      },
      "aggs": {
        "name_count": {
          "terms": {
            "field": "fullName.raw"
          },
          "aggs": {
            "nat_count": {
              "terms": {
                "field": "natCode"
              },
              "aggs": {
                "my_filter": {
                  "bucket_selector": {
                    "buckets_path": {
                      "the_doc_count": "_count"
                    },
                    "script": {
                      "source": "params.the_doc_count>2"
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

What is achieved: It is filtering on date, creating bucket on fullname (name_count) and sub-bucket on natcode (nat_count) and filtering natcode bucket on doc count.

The problem with this: I can see empty name_count buckets also. I only want buckets that have the required count. Following is the sample of results

"aggregations": {
    "profs": {
      "doc_count": 3754,
      "name_count": {
        "doc_count_error_upper_bound": 4,
        "sum_other_doc_count": 3732,
        "buckets": [
          {
            "key": "JOHN SMITH",
            "doc_count": 3,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "111",
                  "doc_count": 3
                }
              ]
            }
          },
          {
            "key": "MIKE CAIN",
            "doc_count": 3,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "205",
                  "doc_count": 3
                }
              ]
            }
          },
          {
            "key": "JULIA ROBERTS",
            "doc_count": 2,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": []
            }
          },
          {
            "key": "JAMES STEPHEN COOK",
            "doc_count": 2,
            "nat_count": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": []
            }
          }

In the results, I don't want the last two names (JULIA ROBERTS and JAMES STEPHEN COOK) to show up

Additionally what is missing: The ordering on the group count at the end. I'd want the group (fullname, natcode) with the most count to show up

Required further ahead: The grouping needs to be done on a couple of more fields, so they'd be like 4 fields.

Please excuse if I might have used any wrong terms. Hopefully you get the idea of what help is required. Thanks

0

1 Answer 1

1

Below is how your query should be.

Required Query (Final Answer)

POST <your_index_name>/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "birthDate": "18-sep-1993"
        }
      }
    }
  }, 
  "aggs": {
    "groupby_fullname": {
      "terms": {
        "field": "fullName.raw",
        "size": 2000
      },
      "aggs": {
        "natcode_filter": {
          "bucket_selector": {
            "buckets_path": {
              "hits": "groupby_natcode._bucket_count"
            },
            "script": "params.hits > 0"
          }
        },
        "groupby_natcode": {
          "terms": {
            "field": "natCode",
            "size": 2000,
            "min_doc_count": 2
          }
        }
      }
    }
  }
}

Alternative Solution: (Similar to select distinct)

As last resort, what I can come up with is to do something like select distinct based on fullName + "_" + natCode. So basically your keys would be of form JOHN SMITH_111. This does give you accurate result except that the keys would be in this form.

POST <your_index_name>/_search
{  
   "size":0,
   "query":{  
      "bool":{  
         "filter":{  
            "term":{  
               "birthDate":"18-sep-1993"
            }
         }
      }
   },
   "aggs":{  
      "name_count":{  
         "terms":{  
            "script":{  
               "inline":"doc['fullName.raw'].value + params.param + doc['natCode'].value",
               "lang":"painless",
               "params":{  
                  "param":"_"
               }
            }
         },
         "aggs":{  
            "my_filter":{  
               "bucket_selector":{  
                  "buckets_path":{  
                     "doc_count":"_count"
                  },
                  "script":"params.doc_count > 2"
               }
            }
         }
      }
   }
}

Hope it helps.

Sign up to request clarification or add additional context in comments.

19 Comments

The query you have advised is not giving me what is expected. I've updated more details of sample data and the results that are required.
@Maarab I got the cases of the fieldname (typed fullname instead of fullName wrong. Could you please test the query now and let me know if it works. Running the above query (both of them), on the data you've mentioned in the question, gives the result you are looking for.
fullname in the data sample is db column name. I ran the queries and not getting any bucket in the results (Sorry don't know how to format) Result 1: "aggregations": { "profs": { "doc_count": 53, "name_count": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 42, "buckets": [] } } } Result 2: "aggregations": { "name_count": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 42, "buckets": [] } }
@Maarab Apologies. I've mistakenly used the "birthDate": "18-Sep-1933" instead of "birthDate": "18-Sep-1993" which I've corrected in the answers. Please try once again, the same queries and let me know if it works out. If it still doesn't could you please share me your mapping details.
Yes, I just ran it and it kinda seems to be giving me what I require but I need to test a bit more to conclude. Also your suggestion on the filter first is very right. The second query is how the query should be structured i.e. filter first and then apply aggregations. Will try to post the final update after testing.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.