1

The problem domain has got kiosks on which many tokens are displayed. A token is issued by only one issuer and it can be present on multiple kiosks. The kiosk logic accepts/refuses users based on what tokens are present on that kiosk.

Our Elastic mapping is this:

"mappings": {
  "Kiosk": {
     "dynamic": "strict",
     "properties": {
        "kioskId": {
           "type": "keyword"
        },
        "token": {
           "type": "nested",
           "include_in_parent": true,
           "properties": {
              "tokenId": {
                 "type": "keyword"
              },
              "issuer": {
                 "type": "keyword"
              }
           }
        }
     }
  }
}

Here are two typical documents:

Kiosk1  
   "kioskId": "123",
   "token": {
      "tokenId": "fp1",
      "issuer": "i1"

Kiosk2    
   "kioskId": "321",
   "token": [
      {
         "tokenId": "fp1",
         "issuer": "i1"
      },
      {
         "tokenId": "fp2",
         "issuer": "i2"
      }
   ]

Now, the ask is to find count of all the unique tokens in the system bucketed by issuers. There's been no luck in finding them. We tried this query:

POST _search
{
   "aggs": {
      "state": {
         "nested": {
            "path": "token"
         },
         "aggs": {
            "TOKENS_BY_ISSUER": {
               "terms": {
                  "field": "token.issuer"
               }
            }
         }
      }
   }
}

This obviously gives this result:

"aggregations": {
      "state": {
         "doc_count": 3,
         "TOKENS_BY_ISSUER": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "i1",
                  "doc_count": 2
               },
               {
                  "key": "i2",
                  "doc_count": 1
               }
            ]
         }
      }
   }

Is there a way to know that there are only two tokens in the system each issued by i1 and i2? Something like this...

"buckets": [
            {
               "key": "i1",
               "doc_count": 1
            },
            {
               "key": "i2",
               "doc_count": 1
            }
         ]

If not, where has mapping gone wrong? I feel it's not an unusual mapping though. Do note that I have truncated the mapping posted here for brevity, we have further nested levels under tokens. Those additional nested levels carry fields which are specific to a token and its parent kiosk.

1 Answer 1

2

You can change your query to match something like this

    {
   "query": {
      "match_all": {}
   },
   "aggs":{
      "state": {
         "nested": {
            "path": "token"
         },
         "aggs": {
            "TOKENS_BY_ISSUER": {
               "terms": {
                  "field": "token.issuer"
               },
               "aggs":{
                   "distinct_tokens":{
                       "cardinality":{"field":"token.tokenId"}
                   }
               }
            }
         }
      }
   }
}

Note:

  1. The cardinality aggregation in elasticsearch has an error rate associated with it as it uses the HyperLogLog approximation technique to calculate unique field values in a bucket.Hence error rate for the same would increase as the number of tokens in your system increases.
  2. While indexing the kiosk1 document the token should be a vector/array so as to make sure you are not doing anything wrong while indexing.

In order to increase the accuracy of the cardinality aggregation try with increasing the precision_threshold controller in the POST query. This one comes at a cost of more memory utilisation.

Checkout link for further details Elasticsearch Cardinality Aggregation

Would rather recommend designing this based on the requirement and only if you are ready to accept the error percentages when under scale.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. This helped but as you pointed out that it's approximate which is not usable for our scenario where we need exact result.
Its accurate and also super fast in many use cases. But would not recomment when unique token.tokenId in your system crosses a few 100K.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.