Pandas apply to multiple rows with missing dates

Question

For a Pandas DataFrame I am looking for a vectorized way to calculate the cumulative sum of the number of views per given group, except the views from more than a week ago. I have tried all kinds apply functions, but I can't seem to go up and down 7 days to collect the data I need.

I have a function that works on a small amount of data, but because it is a loop it takes way too long on all the data. There are 2500+ groups and every group has about 100 dates filled in. A total of 250.000+ records.

I looked at using shift for instance, but because not all dates are filled in for all the groups, this does not work. I also tried to use the map function, this also look too long.

The Pandas DataFrame I have is this one:

    GROUP DAY           VIEWS   VIEWS_CUM
165 1     2011-09-18    82      82
166 1     2011-09-19    15      97
167 1     2011-12-21    29      126
168 1     2011-12-22    15      141
169 1     2011-12-23    2       143
170 2     2012-01-07    51      51
171 2     2012-01-08    10      61
172 2     2012-01-09    11      72
173 2     2012-01-17    33      105
174 2     2012-01-18    29      134
175 2     2012-01-19    6       140

And I want to get something like this:

    GROUP DAY           VIEWS   VIEWS_CUM   VIEWS_CUM_BEFORE
165 1     2011-09-18    82      82          0
166 1     2011-09-19    15      97          0
167 1     2011-12-21    29      126         29
168 1     2011-12-22    15      141         44
169 1     2011-12-23    2       143         46
170 2     2012-01-07    51      51          0
171 2     2012-01-08    10      61          0
172 2     2012-01-09    11      72          0
173 2     2012-01-17    33      105         33
174 2     2012-01-18    29      134         62
175 2     2012-01-19    6       140         68

The function that seems to work, but is too slow:

import pandas as pd
from pandas.tseries.offsets import *

# Dict with data
data = {'DAY': {0: '09-18-11', 1: '09-19-11', 2: '12-21-11', 3: '12-22-11', 4: '12-23-11', 5: '01-07-12', 6: '01-08-12', 7: '01-09-12', 8: '01-17-12', 9: '01-18-12', 10: '01-19-12'}, 'GROUP': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2}, 'VIEWS': {0: 82, 1: 15, 2: 29, 3: 15, 4: 2, 5: 51, 6: 10, 7: 11, 8: 33, 9: 29, 10: 6}, 'VIEWS_CUM': {0: 82, 1: 97, 2: 126, 3: 141, 4: 143, 5: 51, 6: 61, 7: 72, 8: 105, 9: 134, 10: 140}}

# Convert dict to pandas dataframe
df = pd.DataFrame.from_dict(data)

# Make sure the DAY column is datetime
df['DAY'] = pd.to_datetime(df['DAY'])

# Group by GROUP and DAY
df= df.sort(['GROUP', 'DAY'])

# Default setting for VIEWS_CUM_BEFORE
df['VIEWS_CUM_BEFORE'] = 0

# Loop to add VIEWS_CUM_BEFORE
for index, row in df.iterrows():
    views_cum_before_max = df.loc[(row['GROUP'] == df['GROUP']) & 
                                  (row['DAY'] >= df['DAY'] + Day(7))]['VIEWS_CUM'].max()

    df.ix[index, 'VIEWS_CUM_BEFORE'] = row['VIEWS_CUM'] - views_cum_before_max

# If VIEWS_CUM_BEFORE is empty, make it 0
df['VIEWS_CUM_BEFORE'] = df['VIEWS_CUM_BEFORE'].fillna(0)

# Show result
df

jezrael · Accepted Answer · 2015-09-23 10:56:16Z

I group data by 7 days and cumulative sum is in column VIEWS_CUM_BEFORE.

Only one column solution or

df = df.drop(['VIEWS_CUM'], axis=1)
df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP']).cumsum()

Define column for cumsum solution or

df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP'])['VIEWS'].cumsum()

Numpy cumcum solution

df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP'])['VIEWS'].apply(np.cumsum)

But cumsum count first subgroup and need 0 values them.

    GROUP        DAY  VIEWS  VIEWS_CUM_BEFORE
0       1 2011-09-18     82                82
1       1 2011-09-19     15                97
2       1 2011-12-21     29                29
3       1 2011-12-22     15                44
4       1 2011-12-23      2                46
5       2 2012-01-07     51                51
6       2 2012-01-08     10                10
7       2 2012-01-09     11                21
8       2 2012-01-17     33                33
9       2 2012-01-18     29                62
10      2 2012-01-19      6                68

We have to find minimal DAY of group, add 7 days and then if this day is lower, make it 0.

def repeat_value(grp):
    grp['DAY2'] = grp['DAY'].min() + pd.Timedelta('7 days')
    return grp
df = df.groupby(['GROUP']).apply(repeat_value)
print df

    GROUP        DAY  VIEWS  VIEWS_CUM_BEFORE       DAY2
0       1 2011-09-18     82                82 2011-09-25
1       1 2011-09-19     15                97 2011-09-25
2       1 2011-12-21     29                29 2011-09-25
3       1 2011-12-22     15                44 2011-09-25
4       1 2011-12-23      2                46 2011-09-25
5       2 2012-01-07     51                51 2012-01-14
6       2 2012-01-08     10                10 2012-01-14
7       2 2012-01-09     11                21 2012-01-14
8       2 2012-01-17     33                33 2012-01-14
9       2 2012-01-18     29                62 2012-01-14
10      2 2012-01-19      6                68 2012-01-14


df.loc[df['DAY2'] > df['DAY'], 'VIEWS_CUM_BEFORE'] = 0
del df['DAY2']
print df

    GROUP        DAY  VIEWS  VIEWS_CUM_BEFORE
0       1 2011-09-18     82                 0
1       1 2011-09-19     15                 0
2       1 2011-12-21     29                29
3       1 2011-12-22     15                44
4       1 2011-12-23      2                46
5       2 2012-01-07     51                 0
6       2 2012-01-08     10                 0
7       2 2012-01-09     11                 0
8       2 2012-01-17     33                33
9       2 2012-01-18     29                62
10      2 2012-01-19      6                68

Collectives™ on Stack Overflow

Pandas apply to multiple rows with missing dates

1 Answer 1

Only one column solution or

Define column for cumsum solution or

Numpy cumcum solution

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Only one column solution or

Define column for cumsum solution or

Numpy cumcum solution

Comments

Your Answer

Sign up or log in

Post as a guest

Related