5

I have a dataframe which has a MultiIndex where the last column of the index is a date. I am trying to make a rolling operation on the columns with a specific frequency. As I understand it, the usual pandas approach if I had a TimeIndex would be to call the rolling function with a string of the frequency (for example '2D' if I wanted the window to be two days). Yet another approach suggested is to resample the TimeIndex and then apply rolling function with integer 2. Essentially what I want to be able to do is group by all the columns except for the last one and then tell the rolling column to use the last column for timedelta-specific rolling. Below is an example to demonstrate this:

from datetime import datetime
import pandas as pd
multi_index = pd.MultiIndex.from_tuples([
    ("A", datetime(2017, 1, 1)), 
    ("A", datetime(2017, 1, 2)), 
    ("A", datetime(2017, 1, 3)), 
    ("A", datetime(2017, 1, 4)),
    ("B", datetime(2017, 1, 1)),
    ("B", datetime(2017, 1, 3)),
    ("B", datetime(2017, 1, 4))])
df = pd.DataFrame(index=multi_index, data={"colA": [1, 1, 1, 1, 1, 1, 1]})
display(df)
df.groupby([df.index.get_level_values(0), pd.Grouper(freq="1D", level=-1)]).sum().rolling(2).sum

The above code does not create a row for (B, datetime(2017, 1, 2)) and so the rolling sums will be all two.

One ugly way to get around this, which really only works if there is a group which has all the days is to unstack, fillna and stack before rolling:

df.groupby([df.index.get_level_values(0), pd.Grouper(freq="1D", level=-1)]
).sum().unstack().fillna(0).stack().rolling(2).sum()

Needless to say this is an ugly hack, slow and error-prone. Is there a nice way achieve what I need here without extensive manipulation? Ideally some way to tell the grouper to take the timestamp column or fill missing values itself?

1 Answer 1

6

You can use groupby + resample + fillna - need version pandas 0.19.0:

multi_index = pd.MultiIndex.from_tuples([
    ("A", datetime(2017, 1, 1)), 
    ("A", datetime(2017, 1, 2)), 
    ("A", datetime(2017, 1, 3)), 
    ("A", datetime(2017, 1, 4)),
    ("B", datetime(2017, 1, 1)),
    ("B", datetime(2017, 1, 3)),
    ("B", datetime(2017, 1, 4))])
df = pd.DataFrame(index=multi_index, data={"colA": [1, 2, 3, 4, 1, 2, 3]})
print (df)
              colA
A 2017-01-01     1
  2017-01-02     2
  2017-01-03     3
  2017-01-04     4
B 2017-01-01     1
  2017-01-03     2
  2017-01-04     3

b = df.groupby(level=0).resample('1D', level=1).sum().fillna(0).rolling(2).sum()
print (b)
              colA
A 2017-01-01   NaN
  2017-01-02   3.0
  2017-01-03   5.0
  2017-01-04   7.0
B 2017-01-01   5.0
  2017-01-02   1.0
  2017-01-03   2.0
  2017-01-04   5.0
Sign up to request clarification or add additional context in comments.

4 Comments

Awesome answer, however I want the first B to be Nan (as it is a new group)... Using you code I was able to do that: `` df.groupby(level=0).resample('1D', level=1).sum().fillna(0).groupby(level=0).apply(lambda x: x.rolling(2).sum()) ``
But if use df.groupby([df.index.get_level_values(0), pd.Grouper(freq="1D", level=-1)] ).sum().unstack().fillna(0).stack().rolling(2).sum() then get same output.
You are right. Your answer perfectly did what I was asking. I upvoted obviously as it was very helpful (I am a noob here so my upvotes do not count apparently!) Thanks so much for your help!
I have a similar issue, just one slight modification: How can I get the rolling sum to 'reset', i.e. the first value in colA for 'B' (in frame b) should be NaN as opposed to 5? Essentially, calculate the rolling sum across the dates, for each item in level 0 of the index separately without overlap

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.