0

I have a use case where I need to calculate set overlaps over arbitrary time periods.

My data looks like this, when loaded into pandas. In MySQL the user_ids is stored with the data type JSON.

Data in pandas

I need to calculate the size of the union set when grouping by the date column. E.g, in the example below, if 2021-01-31 is grouped with 2021-02-28, then the result should be

In [1]: len(set([46, 44, 14] + [44, 7, 36]))
Out[1]: 5

Doing this in Python is trivial, but I'm struggling with how to do this in MySQL.

Aggregating the arrays into an array of arrays is easy:

SELECT 
    date,
    JSON_ARRAYAGG(user_ids) as uids
FROM mytable
GROUP BY date

after aggregating arrays

but after that I face two problems:

  1. How to flatten the array of arrays into a single array
  2. How to extract distinct values (e.g. convert the array into a set)

Any suggestions? Thank you!

PS. In my case I can probably get by with doing the flattening and set conversion on the client side, but I was pretty surprised at how difficult something simple like this turned out to be... :/

3
  • Parse to separate values, count distinct values amount. Commented Aug 26, 2022 at 5:58
  • In MySQL the user_ids is stored with the data type JSON. This denormalized structure produces your problem. Normalize your data. Commented Aug 26, 2022 at 5:59
  • I omitted a lot of fields that I'm grouping over as well. Normalizing the data results in a very large table that's slow to query. This is exactly what I want to avoid. @Akina Commented Sep 12, 2022 at 3:12

2 Answers 2

1

As mentioned in other comments, storing JSON arrays in your database really is sub-optimal and should really be avoided. Aside from that, it actually is easier to first extract the JSON array (and get the result you wanted from your second point):

SELECT mytable.date, jtable.VAL as user_id
FROM mytable, JSON_TABLE(user_ids, '$[*]' COLUMNS(VAL INT PATH '$')) jtable;

From here on out, we can group the dates again and recombine the user_ids into a JSON array with the JSON_ARRAYAGG function you already found:

SELECT mytable.date, JSON_ARRAYAGG(jtable.VAL) as user_ids
FROM mytable, JSON_TABLE(user_ids, '$[*]' COLUMNS(VAL INT PATH '$')) jtable
GROUP BY mytable.date;

You can try this out in this DB fiddle.

NOTE: this does require mysql 8+/mariaDB 10.6+.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your answer, it was very helpful! I'm wondering why it is always considered bad to use JSON arrays? I do think it has use cases in analytics applications, so categorically rejecting that functionality does seem unnecessarily dogmatic to me, with all due respect :).
Of course I agree that using JSON for transactional applications is a bad idea.
The existence of this question on SO already kind of explains why this is a bad idea: it overcomplicates your queries and it is detrimental for performance. It looks like the json arrays don't have any use and you would be better off just storing a UID per row. That said, I personally think some JSON objects (primarily loose objects without a fixed structure) are OK to be stored in a relational database. But if you're storing fixed format JSON, there's almost always a better way to do it in pure SQL.
I needed some way of pre-aggregating the data in order to avoid having to scan the entire 200GiB table whenever I want to calculate the set overlap. My question was about whether or not it's possible to manipulate the pre-aggregated data on the db side or not. In the end that turns out to be rather complicated as you pointed out, but surely the solution is not to use one row per uid, since that is not a solution to the problem that I'm trying to solve as it takes several minutes to calculate set overlap over that original 200GiB table. Thank you for answering the original question though.
0

Thank you for the answers.

For anybody who's interested, the solution that I ended up with in the end was to store the data like this:

enter image description here

And then do the set calculations in pandas.

(
    df.groupby(pd.Grouper(key="date", freq="QS"),).aggregate(
        num_unique_users=(
            "user_ids",
            lambda uids: len(set([_ for ul in uids for _ in ul])),
        ),
    )
)

I was able to reduce a 20GiB table to around 300MiB, which is fast enough to query and retrieve data from.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.