Count of unique nested documents in ElasticSearch

Question

The problem domain has got kiosks on which many tokens are displayed. A token is issued by only one issuer and it can be present on multiple kiosks. The kiosk logic accepts/refuses users based on what tokens are present on that kiosk.

Our Elastic mapping is this:

"mappings": {
  "Kiosk": {
     "dynamic": "strict",
     "properties": {
        "kioskId": {
           "type": "keyword"
        },
        "token": {
           "type": "nested",
           "include_in_parent": true,
           "properties": {
              "tokenId": {
                 "type": "keyword"
              },
              "issuer": {
                 "type": "keyword"
              }
           }
        }
     }
  }
}

Here are two typical documents:

Kiosk1  
   "kioskId": "123",
   "token": {
      "tokenId": "fp1",
      "issuer": "i1"

Kiosk2    
   "kioskId": "321",
   "token": [
      {
         "tokenId": "fp1",
         "issuer": "i1"
      },
      {
         "tokenId": "fp2",
         "issuer": "i2"
      }
   ]

Now, the ask is to find count of all the unique tokens in the system bucketed by issuers. There's been no luck in finding them. We tried this query:

POST _search
{
   "aggs": {
      "state": {
         "nested": {
            "path": "token"
         },
         "aggs": {
            "TOKENS_BY_ISSUER": {
               "terms": {
                  "field": "token.issuer"
               }
            }
         }
      }
   }
}

This obviously gives this result:

"aggregations": {
      "state": {
         "doc_count": 3,
         "TOKENS_BY_ISSUER": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "i1",
                  "doc_count": 2
               },
               {
                  "key": "i2",
                  "doc_count": 1
               }
            ]
         }
      }
   }

Is there a way to know that there are only two tokens in the system each issued by i1 and i2? Something like this...

"buckets": [
            {
               "key": "i1",
               "doc_count": 1
            },
            {
               "key": "i2",
               "doc_count": 1
            }
         ]

If not, where has mapping gone wrong? I feel it's not an unusual mapping though. Do note that I have truncated the mapping posted here for brevity, we have further nested levels under tokens. Those additional nested levels carry fields which are specific to a token and its parent kiosk.

Raunak Bhansali · Accepted Answer · 2017-06-18 10:01:57Z

2

You can change your query to match something like this

    {
   "query": {
      "match_all": {}
   },
   "aggs":{
      "state": {
         "nested": {
            "path": "token"
         },
         "aggs": {
            "TOKENS_BY_ISSUER": {
               "terms": {
                  "field": "token.issuer"
               },
               "aggs":{
                   "distinct_tokens":{
                       "cardinality":{"field":"token.tokenId"}
                   }
               }
            }
         }
      }
   }
}

Note:

The cardinality aggregation in elasticsearch has an error rate associated with it as it uses the HyperLogLog approximation technique to calculate unique field values in a bucket.Hence error rate for the same would increase as the number of tokens in your system increases.
While indexing the kiosk1 document the token should be a vector/array so as to make sure you are not doing anything wrong while indexing.

In order to increase the accuracy of the cardinality aggregation try with increasing the precision_threshold controller in the POST query. This one comes at a cost of more memory utilisation.

Checkout link for further details Elasticsearch Cardinality Aggregation

Would rather recommend designing this based on the requirement and only if you are ready to accept the error percentages when under scale.

answered Jun 18, 2017 at 10:01

Raunak Bhansali

1321 silver badge4 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sau Over a year ago

Thanks. This helped but as you pointed out that it's approximate which is not usable for our scenario where we need exact result.

Raunak Bhansali Over a year ago

Its accurate and also super fast in many use cases. But would not recomment when unique token.tokenId in your system crosses a few 100K.

Collectives™ on Stack Overflow

Count of unique nested documents in ElasticSearch

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related