More efficient use of Python for loops with subsetting of a dataframe

Question

I have the following running through a large number of unique IDs to iterate through and create summary statistics based on the current + prior visits. While this works for a small amount of data, this code can be quite lengthy on the larger set. Is there a faster way of approaching this (without using multiprocessing)?

import pandas as pd

d = {
    'id': ['A','B', 'B', 'C'],
    'visit_id': ['asd', 'awd', 'qdw', 'qwb'],
    'value': [-343.68, 343.68, -55.2, 55.2]}

df = pd.DataFrame(data=d)

agg_users = pd.DataFrame()

for i in df['id'].unique():
    user_tbl = df.loc[df['id']==i]
    user_tbl.insert(0, 'visit_sequence', range(0, 0 + len(user_tbl)))

    agg_sessions = pd.DataFrame()
    for i in user_tbl['visit_sequence']:
        tmp = user_tbl.loc[user_tbl['visit_sequence'] <= i]
        ses = tmp.loc[user_tbl['visit_sequence'] == i, 'visit_id'].item()

        aggs = {
            'value': ['min', 'max', 'mean']
        }

        tmp2 = tmp.groupby('id').agg(aggs)

        new_columns = [k + '_' + agg for k in aggs.keys() for agg in aggs[k]]
        tmp2.columns = new_columns

        tmp2.reset_index(inplace=True)
        tmp2.insert(1, 'visit_id', ses)

        agg_sessions = pd.concat([agg_sessions, tmp2])

    agg_users = pd.concat([agg_users, agg_sessions])

agg_users

Peter Leimbigler · Accepted Answer · 2018-09-18 23:08:10Z

1

Based on the output of your code, I think you are looking for expanding-window aggregation; docs.

The following solution is a bit clunky because of a pandas bug in df.groupby('colname').expanding().agg() documented in this GitHub issue.

# First, sort by id, then visit_id before grouping by id.
# Pandas groupby preserves the order of rows within each group:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

df.sort_values(['id', 'visit_id'], inplace=True)

# Calculate expanding-window aggregations for each id
aggmin = df.groupby('id').expanding()['value'].min().to_frame(name='value_min')
aggmax = df.groupby('id').expanding()['value'].max().to_frame(name='value_max')
aggmean = df.groupby('id').expanding()['value'].mean().to_frame(name='value_mean')

# Combine the above aggregations, and drop the extra index level
agged = pd.concat([aggmin, aggmax, aggmean], axis=1).reset_index().drop('level_1', axis=1)

# Bring in the visit ids, which are guaranteed to be in the correct sort order
agged['visit_id'] = df['visit_id']

# Rearrange columns
agged = agged[['id', 'visit_id', 'value_min', 'value_max', 'value_mean']]

agged
  id visit_id  value_min  value_max  value_mean
0  A      asd    -343.68    -343.68     -343.68
1  B      awd     343.68     343.68      343.68
2  B      qdw     -55.20     343.68      144.24
3  C      qwb      55.20      55.20       55.20


# Output of your code:
agg_users
  id visit_id  value_min  value_max  value_mean
0  A      asd    -343.68    -343.68     -343.68
0  B      awd     343.68     343.68      343.68
0  B      qdw     -55.20     343.68      144.24
0  C      qwb      55.20      55.20       55.20

answered Sep 18, 2018 at 23:08

Peter Leimbigler

11.1k1 gold badge27 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

yokota Over a year ago

Thank you. I had no idea about expanding(). Learn something new everyday.

yokota Over a year ago

Is it possible to add a dictionary to perform aggs() or is that a limitation of this approach?

Peter Leimbigler Over a year ago

@yokota, I tried that, but unfortunately ran into the GitHub issue linked in my answer... It seems that support for that kind of "advanced" expanding().agg() support will appear in pandas 0.24.

yokota Over a year ago

Ah. Got it. Thanks, @peter-leimbigler. Looking forward to 0.24.

Andy Hayden · Accepted Answer · 2018-09-18 22:43:53Z

0

You want to use a groupby and agg:

In [11]: res = df.groupby(["id", "visit_id"]).agg({"value": ["min", "max", "mean"]})

In [12]: res
Out[12]:
              value
                min     max    mean
id visit_id
A  asd      -343.68 -343.68 -343.68
B  awd       343.68  343.68  343.68
   qdw       -55.20  -55.20  -55.20
C  qwb        55.20   55.20   55.20

To remove the MultiIndex you can set the columns explicitly:

In [13]: res.columns = ["value_min", "value_max", "value_mean"]

In [14]: res
Out[14]:
             value_min  value_max  value_mean
id visit_id
A  asd         -343.68    -343.68     -343.68
B  awd          343.68     343.68      343.68
   qdw          -55.20     -55.20      -55.20
C  qwb           55.20      55.20       55.20

In [15]: res.reset_index()
Out[15]:
  id visit_id  value_min  value_max  value_mean
0  A      asd    -343.68    -343.68     -343.68
1  B      awd     343.68     343.68      343.68
2  B      qdw     -55.20     -55.20      -55.20
3  C      qwb      55.20      55.20       55.20

gets you the same result.

answered Sep 18, 2018 at 22:43

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

2 Comments

Andy Hayden Over a year ago

Note: You used to be able to do the rename in a single step, but that's deprecated now (FutureWarning).

Peter Leimbigler Over a year ago

This output doesn't quite match the output of @yokota's code. Theirs aggregates min, max, and mean in expanding windows for each id, rather than a straightforward groupby(['id', 'visit_id']).

Collectives™ on Stack Overflow

More efficient use of Python for loops with subsetting of a dataframe

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related