1

I have the following running through a large number of unique IDs to iterate through and create summary statistics based on the current + prior visits. While this works for a small amount of data, this code can be quite lengthy on the larger set. Is there a faster way of approaching this (without using multiprocessing)?

import pandas as pd

d = {
    'id': ['A','B', 'B', 'C'],
    'visit_id': ['asd', 'awd', 'qdw', 'qwb'],
    'value': [-343.68, 343.68, -55.2, 55.2]}

df = pd.DataFrame(data=d)

agg_users = pd.DataFrame()

for i in df['id'].unique():
    user_tbl = df.loc[df['id']==i]
    user_tbl.insert(0, 'visit_sequence', range(0, 0 + len(user_tbl)))

    agg_sessions = pd.DataFrame()
    for i in user_tbl['visit_sequence']:
        tmp = user_tbl.loc[user_tbl['visit_sequence'] <= i]
        ses = tmp.loc[user_tbl['visit_sequence'] == i, 'visit_id'].item()

        aggs = {
            'value': ['min', 'max', 'mean']
        }

        tmp2 = tmp.groupby('id').agg(aggs)

        new_columns = [k + '_' + agg for k in aggs.keys() for agg in aggs[k]]
        tmp2.columns = new_columns

        tmp2.reset_index(inplace=True)
        tmp2.insert(1, 'visit_id', ses)

        agg_sessions = pd.concat([agg_sessions, tmp2])

    agg_users = pd.concat([agg_users, agg_sessions])

agg_users

2 Answers 2

1

Based on the output of your code, I think you are looking for expanding-window aggregation; docs.

The following solution is a bit clunky because of a pandas bug in df.groupby('colname').expanding().agg() documented in this GitHub issue.

# First, sort by id, then visit_id before grouping by id.
# Pandas groupby preserves the order of rows within each group:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

df.sort_values(['id', 'visit_id'], inplace=True)

# Calculate expanding-window aggregations for each id
aggmin = df.groupby('id').expanding()['value'].min().to_frame(name='value_min')
aggmax = df.groupby('id').expanding()['value'].max().to_frame(name='value_max')
aggmean = df.groupby('id').expanding()['value'].mean().to_frame(name='value_mean')

# Combine the above aggregations, and drop the extra index level
agged = pd.concat([aggmin, aggmax, aggmean], axis=1).reset_index().drop('level_1', axis=1)

# Bring in the visit ids, which are guaranteed to be in the correct sort order
agged['visit_id'] = df['visit_id']

# Rearrange columns
agged = agged[['id', 'visit_id', 'value_min', 'value_max', 'value_mean']]

agged
  id visit_id  value_min  value_max  value_mean
0  A      asd    -343.68    -343.68     -343.68
1  B      awd     343.68     343.68      343.68
2  B      qdw     -55.20     343.68      144.24
3  C      qwb      55.20      55.20       55.20


# Output of your code:
agg_users
  id visit_id  value_min  value_max  value_mean
0  A      asd    -343.68    -343.68     -343.68
0  B      awd     343.68     343.68      343.68
0  B      qdw     -55.20     343.68      144.24
0  C      qwb      55.20      55.20       55.20
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. I had no idea about expanding(). Learn something new everyday.
Is it possible to add a dictionary to perform aggs() or is that a limitation of this approach?
@yokota, I tried that, but unfortunately ran into the GitHub issue linked in my answer... It seems that support for that kind of "advanced" expanding().agg() support will appear in pandas 0.24.
Ah. Got it. Thanks, @peter-leimbigler. Looking forward to 0.24.
0

You want to use a groupby and agg:

In [11]: res = df.groupby(["id", "visit_id"]).agg({"value": ["min", "max", "mean"]})

In [12]: res
Out[12]:
              value
                min     max    mean
id visit_id
A  asd      -343.68 -343.68 -343.68
B  awd       343.68  343.68  343.68
   qdw       -55.20  -55.20  -55.20
C  qwb        55.20   55.20   55.20

To remove the MultiIndex you can set the columns explicitly:

In [13]: res.columns = ["value_min", "value_max", "value_mean"]

In [14]: res
Out[14]:
             value_min  value_max  value_mean
id visit_id
A  asd         -343.68    -343.68     -343.68
B  awd          343.68     343.68      343.68
   qdw          -55.20     -55.20      -55.20
C  qwb           55.20      55.20       55.20

In [15]: res.reset_index()
Out[15]:
  id visit_id  value_min  value_max  value_mean
0  A      asd    -343.68    -343.68     -343.68
1  B      awd     343.68     343.68      343.68
2  B      qdw     -55.20     -55.20      -55.20
3  C      qwb      55.20      55.20       55.20

gets you the same result.

2 Comments

Note: You used to be able to do the rename in a single step, but that's deprecated now (FutureWarning).
This output doesn't quite match the output of @yokota's code. Theirs aggregates min, max, and mean in expanding windows for each id, rather than a straightforward groupby(['id', 'visit_id']).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.