Aggregating JSON arrays and calculating set union size in MySQL

Question

I have a use case where I need to calculate set overlaps over arbitrary time periods.

My data looks like this, when loaded into pandas. In MySQL the user_ids is stored with the data type JSON.

I need to calculate the size of the union set when grouping by the date column. E.g, in the example below, if 2021-01-31 is grouped with 2021-02-28, then the result should be

In [1]: len(set([46, 44, 14] + [44, 7, 36]))
Out[1]: 5

Doing this in Python is trivial, but I'm struggling with how to do this in MySQL.

Aggregating the arrays into an array of arrays is easy:

SELECT 
    date,
    JSON_ARRAYAGG(user_ids) as uids
FROM mytable
GROUP BY date

but after that I face two problems:

How to flatten the array of arrays into a single array
How to extract distinct values (e.g. convert the array into a set)

Any suggestions? Thank you!

PS. In my case I can probably get by with doing the flattening and set conversion on the client side, but I was pretty surprised at how difficult something simple like this turned out to be... :/

In MySQL the user_ids is stored with the data type JSON. This denormalized structure produces your problem. Normalize your data. — Akina
– Akina, Commented Aug 26, 2022 at 5:59
I omitted a lot of fields that I'm grouping over as well. Normalizing the data results in a very large table that's slow to query. This is exactly what I want to avoid. @Akina — halfdanr
– halfdanr, Commented Sep 12, 2022 at 3:12

vixducis · Accepted Answer · 2022-08-26 11:52:18Z

1

As mentioned in other comments, storing JSON arrays in your database really is sub-optimal and should really be avoided. Aside from that, it actually is easier to first extract the JSON array (and get the result you wanted from your second point):

SELECT mytable.date, jtable.VAL as user_id
FROM mytable, JSON_TABLE(user_ids, '$[*]' COLUMNS(VAL INT PATH '$')) jtable;

From here on out, we can group the dates again and recombine the user_ids into a JSON array with the JSON_ARRAYAGG function you already found:

SELECT mytable.date, JSON_ARRAYAGG(jtable.VAL) as user_ids
FROM mytable, JSON_TABLE(user_ids, '$[*]' COLUMNS(VAL INT PATH '$')) jtable
GROUP BY mytable.date;

You can try this out in this DB fiddle.

NOTE: this does require mysql 8+/mariaDB 10.6+.

edited Aug 26, 2022 at 11:52

answered Aug 26, 2022 at 10:41

vixducis

1,1171 gold badge9 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

halfdanr Over a year ago

Thank you for your answer, it was very helpful! I'm wondering why it is always considered bad to use JSON arrays? I do think it has use cases in analytics applications, so categorically rejecting that functionality does seem unnecessarily dogmatic to me, with all due respect :).

halfdanr Over a year ago

Of course I agree that using JSON for transactional applications is a bad idea.

vixducis Over a year ago

The existence of this question on SO already kind of explains why this is a bad idea: it overcomplicates your queries and it is detrimental for performance. It looks like the json arrays don't have any use and you would be better off just storing a UID per row. That said, I personally think some JSON objects (primarily loose objects without a fixed structure) are OK to be stored in a relational database. But if you're storing fixed format JSON, there's almost always a better way to do it in pure SQL.

halfdanr Over a year ago

I needed some way of pre-aggregating the data in order to avoid having to scan the entire 200GiB table whenever I want to calculate the set overlap. My question was about whether or not it's possible to manipulate the pre-aggregated data on the db side or not. In the end that turns out to be rather complicated as you pointed out, but surely the solution is not to use one row per uid, since that is not a solution to the problem that I'm trying to solve as it takes several minutes to calculate set overlap over that original 200GiB table. Thank you for answering the original question though.

halfdanr · Accepted Answer · 2022-09-12 03:29:14Z

0

Thank you for the answers.

For anybody who's interested, the solution that I ended up with in the end was to store the data like this:

And then do the set calculations in pandas.

(
    df.groupby(pd.Grouper(key="date", freq="QS"),).aggregate(
        num_unique_users=(
            "user_ids",
            lambda uids: len(set([_ for ul in uids for _ in ul])),
        ),
    )
)

I was able to reduce a 20GiB table to around 300MiB, which is fast enough to query and retrieve data from.

answered Sep 12, 2022 at 3:29

halfdanr

3934 silver badges11 bronze badges

Collectives™ on Stack Overflow

Aggregating JSON arrays and calculating set union size in MySQL

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related