2

For a Pandas DataFrame I am looking for a vectorized way to calculate the cumulative sum of the number of views per given group, except the views from more than a week ago. I have tried all kinds apply functions, but I can't seem to go up and down 7 days to collect the data I need.

I have a function that works on a small amount of data, but because it is a loop it takes way too long on all the data. There are 2500+ groups and every group has about 100 dates filled in. A total of 250.000+ records.

I looked at using shift for instance, but because not all dates are filled in for all the groups, this does not work. I also tried to use the map function, this also look too long.

The Pandas DataFrame I have is this one:

    GROUP DAY           VIEWS   VIEWS_CUM
165 1     2011-09-18    82      82
166 1     2011-09-19    15      97
167 1     2011-12-21    29      126
168 1     2011-12-22    15      141
169 1     2011-12-23    2       143
170 2     2012-01-07    51      51
171 2     2012-01-08    10      61
172 2     2012-01-09    11      72
173 2     2012-01-17    33      105
174 2     2012-01-18    29      134
175 2     2012-01-19    6       140

And I want to get something like this:

    GROUP DAY           VIEWS   VIEWS_CUM   VIEWS_CUM_BEFORE
165 1     2011-09-18    82      82          0
166 1     2011-09-19    15      97          0
167 1     2011-12-21    29      126         29
168 1     2011-12-22    15      141         44
169 1     2011-12-23    2       143         46
170 2     2012-01-07    51      51          0
171 2     2012-01-08    10      61          0
172 2     2012-01-09    11      72          0
173 2     2012-01-17    33      105         33
174 2     2012-01-18    29      134         62
175 2     2012-01-19    6       140         68

The function that seems to work, but is too slow:

import pandas as pd
from pandas.tseries.offsets import *

# Dict with data
data = {'DAY': {0: '09-18-11', 1: '09-19-11', 2: '12-21-11', 3: '12-22-11', 4: '12-23-11', 5: '01-07-12', 6: '01-08-12', 7: '01-09-12', 8: '01-17-12', 9: '01-18-12', 10: '01-19-12'}, 'GROUP': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2}, 'VIEWS': {0: 82, 1: 15, 2: 29, 3: 15, 4: 2, 5: 51, 6: 10, 7: 11, 8: 33, 9: 29, 10: 6}, 'VIEWS_CUM': {0: 82, 1: 97, 2: 126, 3: 141, 4: 143, 5: 51, 6: 61, 7: 72, 8: 105, 9: 134, 10: 140}}

# Convert dict to pandas dataframe
df = pd.DataFrame.from_dict(data)

# Make sure the DAY column is datetime
df['DAY'] = pd.to_datetime(df['DAY'])

# Group by GROUP and DAY
df= df.sort(['GROUP', 'DAY'])

# Default setting for VIEWS_CUM_BEFORE
df['VIEWS_CUM_BEFORE'] = 0

# Loop to add VIEWS_CUM_BEFORE
for index, row in df.iterrows():
    views_cum_before_max = df.loc[(row['GROUP'] == df['GROUP']) & 
                                  (row['DAY'] >= df['DAY'] + Day(7))]['VIEWS_CUM'].max()

    df.ix[index, 'VIEWS_CUM_BEFORE'] = row['VIEWS_CUM'] - views_cum_before_max

# If VIEWS_CUM_BEFORE is empty, make it 0
df['VIEWS_CUM_BEFORE'] = df['VIEWS_CUM_BEFORE'].fillna(0)

# Show result
df

1 Answer 1

2

I group data by 7 days and cumulative sum is in column VIEWS_CUM_BEFORE.

Only one column solution or

df = df.drop(['VIEWS_CUM'], axis=1)
df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP']).cumsum()

Define column for cumsum solution or

df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP'])['VIEWS'].cumsum()

Numpy cumcum solution

df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP'])['VIEWS'].apply(np.cumsum)

But cumsum count first subgroup and need 0 values them.

    GROUP        DAY  VIEWS  VIEWS_CUM_BEFORE
0       1 2011-09-18     82                82
1       1 2011-09-19     15                97
2       1 2011-12-21     29                29
3       1 2011-12-22     15                44
4       1 2011-12-23      2                46
5       2 2012-01-07     51                51
6       2 2012-01-08     10                10
7       2 2012-01-09     11                21
8       2 2012-01-17     33                33
9       2 2012-01-18     29                62
10      2 2012-01-19      6                68

We have to find minimal DAY of group, add 7 days and then if this day is lower, make it 0.

def repeat_value(grp):
    grp['DAY2'] = grp['DAY'].min() + pd.Timedelta('7 days')
    return grp
df = df.groupby(['GROUP']).apply(repeat_value)
print df
    GROUP        DAY  VIEWS  VIEWS_CUM_BEFORE       DAY2
0       1 2011-09-18     82                82 2011-09-25
1       1 2011-09-19     15                97 2011-09-25
2       1 2011-12-21     29                29 2011-09-25
3       1 2011-12-22     15                44 2011-09-25
4       1 2011-12-23      2                46 2011-09-25
5       2 2012-01-07     51                51 2012-01-14
6       2 2012-01-08     10                10 2012-01-14
7       2 2012-01-09     11                21 2012-01-14
8       2 2012-01-17     33                33 2012-01-14
9       2 2012-01-18     29                62 2012-01-14
10      2 2012-01-19      6                68 2012-01-14


df.loc[df['DAY2'] > df['DAY'], 'VIEWS_CUM_BEFORE'] = 0
del df['DAY2']
print df
    GROUP        DAY  VIEWS  VIEWS_CUM_BEFORE
0       1 2011-09-18     82                 0
1       1 2011-09-19     15                 0
2       1 2011-12-21     29                29
3       1 2011-12-22     15                44
4       1 2011-12-23      2                46
5       2 2012-01-07     51                 0
6       2 2012-01-08     10                 0
7       2 2012-01-09     11                 0
8       2 2012-01-17     33                33
9       2 2012-01-18     29                62
10      2 2012-01-19      6                68
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.