pandas - adding new aggregated feature

Question

I have this dataframe in pandas:

   day customer  amount
0    1    cust1     500
1    2    cust2     100
2    1    cust1      50
3    2    cust1     100
4    2    cust2     250
5    6    cust1      20

For convenience:

df = pd.DataFrame({'day': [1, 2, 1, 2, 2, 6],
                   'customer': ['cust1', 'cust2', 'cust1', 'cust1', 'cust2', 'cust1'],
                   'amount': [500, 100, 50, 100, 250, 20]})

I would like to create a new column 'amount2days' so as to aggragate amounts per customer for the last two days, to get the following dataframe:

   day customer  amount    amount2days   ----------------------------
0    1    cust1     500    500           (no past transactions)
1    2    cust2     100    100           (no past transactions)
2    1    cust1      50    550           (500 + 50 = rows 0,2 
3    2    cust1     100    650           (500 + 50 + 100, rows 0,2,3)
4    2    cust2     250    350           (100 + 250, rows 1,4) 
5    6    cust1      20    20            (notice day is 6, and no day=5 for cust1)

i.e. I would like to perform the following (pseudo) code:

df['amount2days'] = df_of_past_2_days['amount'].sum()

for each row. What is the most convenient way to do so?

The summing I wish to peroform is over the day, but days does not necessarily have to increment in each new row, as shown in the example. Still I want to sum amounts over the past 2 days.

Does this answer your question? pandas rolling sum of last five minutes — code11
– code11, Commented Oct 6, 2020 at 13:41

jezrael · Accepted Answer · 2020-10-08 07:19:57Z

2

Use groupby with Series.rolling and sum

Notice: Here is necessary add DataFrame.reset_index for avoid wrong alignement of data:

df['amount2days'] = (df.groupby('customer')['amount']
                       .rolling(2, min_periods=0)
                       .sum()
                       .reset_index(level=0, drop=True))
print (df)
   day customer  amount  amount2days
1    1    cust1     500        500.0
2    2    cust1     100        600.0
3    3    cust1     250        350.0

Why not use .to_numpy here? Because if not default index then output should be assigned wrongly - check sample below:

df = pd.DataFrame({'day': {0: 1, 2: 2, 5: 3, 1: 1, 6: 2, 4: 3}, 'customer': {0: 'cust2', 2: 'cust2', 5: 'cust2', 1: 'cust1', 6: 'cust1', 4: 'cust1'}, 'amount': {0: 5000, 2: 1000, 5: 2500, 1: 500, 6: 100, 4: 250}})
print (df)
   day customer  amount
0    1    cust2    5000
2    2    cust2    1000
5    3    cust2    2500
1    1    cust1     500
6    2    cust1     100
4    3    cust1     250

df['amount2days'] = (df.groupby('customer', sort=False).amount
                       .rolling(2, min_periods=0)
                       .sum()
                       .to_numpy())

df['amount2days1'] = (df.groupby('customer')['amount']
                       .rolling(2, min_periods=0)
                       .sum()
                       .reset_index(level=0, drop=True))
print (df)
   day customer  amount  amount2days  amount2days1
0    1    cust2    5000        500.0        5000.0
2    2    cust2    1000        600.0        6000.0
5    3    cust2    2500        350.0        3500.0
1    1    cust1     500       5000.0         500.0
6    2    cust1     100       6000.0         600.0
4    3    cust1     250       3500.0         350.0

EDIT: General solution:

def f(x):
    N = 1
    for i in pd.unique(x['day']):
        y = x[x['day'].between(i - N, i)]
        x.loc[y.index[-1], 'amountNdays'] = y['amount'].sum()
    
    return x

df = df.groupby('customer').apply(f)
df['amountNdays'] = df['amountNdays'].fillna(df['amount'])
print (df)
   day customer  amount  amountNdays
0    1    cust1     500        500.0
1    2    cust2     100        100.0
2    1    cust1      50        550.0
3    2    cust1     100        650.0
4    2    cust2     250        350.0
5    6    cust1      20         20.0

edited Oct 8, 2020 at 7:19

answered Oct 6, 2020 at 13:41

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

chendoy Over a year ago

Thanks. I've updated my question to be more clear. My data does not necessarily increments 'day' in each new row, still I want to sum over 2 days backwards. Does simple rolling will work in that case?

chendoy Over a year ago

still not working in the case that there is nothing to sum for backwards rows, I edited my example once again (should be just 20, but this method gives 120 there). Thank you.

chendoy Over a year ago

@jezarel with "there" I mean the 'amount2days' of the last row.

chendoy Over a year ago

and maybe there is a way to generalize this to amount_x_days easily?

jezrael Over a year ago

@user112112 - added general solution

yatu · Accepted Answer · 2020-10-06 14:23:24Z

1

You can use panda's rolling for moving window operations (depending on the version of pandas, reset_index as in jezrael's answer will be safer):

df['amount2days'] = (df.groupby('customer', sort=False).amount
                       .rolling(2, min_periods=0)
                       .sum()
                       .to_numpy()) 

print(df)
   day customer  amount  amount2days
1    1    cust1     500        500.0
2    2    cust1     100        600.0
3    3    cust1     250        350.0

edited Oct 6, 2020 at 14:23

answered Oct 6, 2020 at 13:40

yatu

88.6k12 gold badges93 silver badges148 bronze badges

1 Comment

Samuel Liew Over a year ago

Comments are not for extended discussion; this conversation has been moved to chat.

Collectives™ on Stack Overflow

pandas - adding new aggregated feature

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related