4

I have an index transactions with fields user_id, amount and category. I would like to calculate the average amount per user and category, and then finish off with just getting the total average amount per category. The SQL would look like this:

SELET AVG(average), category from

    (SELECT user_id, category, AVG(amount) AS average FROM transactions WHERE amount < 100000 
    GROUP BY user_id, category) AS a1

GROUP BY category

I'm only getting so far as to having a response of a bucket with all user ids, and then inside that, a bucket with the average amounts per category (for the user). I don't understand how to add yet another aggregation to do what I'm after.

{
  "aggs": {
    "group_by_users": {
      "terms": {
        "field": "user_id.keyword"
      },
      "aggs": {
        "group_by_category": {
          "terms": {
              "field": "category.keyword"
          },
          "aggs": {
            "average_amount": {
              "avg": {
                "field": "amount"
              }
            }
          }
        }
      }
    }
  }
}

Any help is very much appreciated.

Edit: Example requested so here is first some sample data and then the intermediate result which will end with the wanted result at the bottom.

-----------------------------------------
|  user_id  |   category   |   amount   |
-----------------------------------------
|     1     |   insurances |   1000     |
|     1     |   transport  |     50     |
|     1     |   transport  |    100     |
|     2     |   insurances |    700     |
|     2     |   insurances |    200     |
|     2     |   transport  |    300     |
-----------------------------------------

Calculation for user 1 transport: (50+100)/2

So, the first thing that needs to happen is a group by user_id and category to get the average per user and category.

That would yield:

-----------------------------------------
|  user_id  |   category   |   average  |
-----------------------------------------
|     1     |   insurances |   1000     |
|     1     |   transport  |     75     |
|     2     |   insurances |    450     |
|     2     |   transport  |    300     |
-----------------------------------------

It's important to understand I can't do an average for all users together, I need the average spend per user, per category, first.

So now I just want to group by category and calculate the average amount:

-----------------------------
|   category   |   average  |
-----------------------------
|   insurances |   725      |
|   transport  |   187,5    |
-----------------------------

Example for insurances: (1000 + 450) / 2

1 Answer 1

7

avg_bucket pipeline aggregation will do the job in some cases (but might not scale well with the size of the index, see notes below):

POST myindex1/_search
{
  "size": 0,
  "aggs": {
    "by category": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "by user_id": {
          "terms": {
            "field": "user_id"
          },
          "aggs": {
            "avg by user": {
              "avg": {
                "field": "amount"
              }
            }
          }
        },
        "average by user, category": {
          "avg_bucket": {
            "buckets_path": "by user_id>avg by user"
          }
        }
      }
    }
  }
}

This will response that looks something like this:

{
  ...
  "aggregations" : {
    "by category" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "insurances",
          "doc_count" : 3,
          "by user_id" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 2,
                "doc_count" : 2,
                "avg by user" : {
                  "value" : 450.0
                }
              },
              {
                "key" : 1,
                "doc_count" : 1,
                "avg by user" : {
                  "value" : 1000.0
                }
              }
            ]
          },
          "average by user, category" : {
            "value" : 725.0   <--- average for `insurances`
          }
        },
        {
          "key" : "transport",
          "doc_count" : 3,
          "by user_id" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 1,
                "doc_count" : 2,
                "avg by user" : {
                  "value" : 75.0
                }
              },
              {
                "key" : 2,
                "doc_count" : 1,
                "avg by user" : {
                  "value" : 300.0
                }
              }
            ]
          },
          "average by user, category" : {
            "value" : 187.5      <--- average for `transport`
          }
        }
      ]
    }
  }
}

How does it work?

Let's start with "by user_id" terms aggregation: we ask Elasticsearch to group documents by user_id and to compute average on amount using avg aggregation.

POST myindex1/_search
{
  "size": 0,
  "aggs": {
    "by user_id": {
      "terms": {
        "field": "user_id"
      },
      "aggs": {
        "avg by user": {
          "avg": {
            "field": "amount"
          }
        }
      }
    }
  }
}

This is equivalent to SQL:

SELECT user_id, avg(amount)
FROM my_index
GROUP BY user_id;

So far this is pretty straightforward. But how can we average also on the category now?

We can add another terms aggregation "by category" on top of "by user id" one. Now avg will take into account also the category:

POST myindex1/_search
{
  "size": 0,
  "aggs": {
    "by category": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "by user_id": {
          "terms": {
            "field": "user_id"
          },
          "aggs": {
            "avg by user": {
              "avg": {
                "field": "amount"
              }
            }
          }
        }
      }
    }
  }
}

This is equivalent to SQL:

SELECT user_id, category, avg(amount)
FROM my_index
GROUP BY user_id, category;

Can we now use the result from the previous query, and aggregate on category again?

This can be done with avg_bucket pipeline aggregation. The only thing missing is to tell avg_bucket aggregation where exactly to find the buckets to aggregate upon, which is done via a buckets_path expression.

This is how we arrive to the query that I posted in the top, and it will effectively do the equivalent of SQL you posted in the question.

But...

What can go wrong?

The downside of this method is that it does not scale well with the number of documents in the index.

The fact is, a pipeline aggregation only operates with already aggregated data:

Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.

In our case this means that if there are more than 10 distinct user_id in the index, our averages will not be precise.

This happens because by default terms aggregation only returns top 10 buckets, and the SQL-ish equivalent should be like the following:

SELECT category, avg(avg_amount)
FROM (
    SELECT user_id, category, avg(amount) avg_amount
    FROM my_index
    GROUP BY user_id, category
    LIMIT 10 per user_id
) Q
LIMIT 10;

This limit can be changed via size parameter of terms aggregation.

Another thing to keep in mind is that terms returns approximate document counts, which can affect the averages as well.


Hope that helps!

Sign up to request clarification or add additional context in comments.

2 Comments

Wow. What a well written answer. Thank you very much for taking your time. I understand it much better now, and will experiment with bucket paths a bit. The query worked perfectly. But I immediately understood what you mean with the scaling, I'm getting "doc_count_error_upper_bound" : 375 (and so on). Maybe I've misunderstood what ES can be used for if these kind of things can't be done. (I'm very new to it).
@Andreas Glad to be of help! You may overcome this limitation with re-modeling your data, like precomputing some of the fields. The section "Modeling your data" of the Definitive Guide might be of use (it is based on ES2, but key concepts didn't change much).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.