pandas dataframe resample aggregate function use multiple columns with a customized function?

Question

Here is an example:

# Generate some random time series dataframe with 'price' and 'volume'
x = pd.date_range('2017-01-01', periods=100, freq='1min')
df_x = pd.DataFrame({'price': np.random.randint(50, 100, size=x.shape), 'vol': np.random.randint(1000, 2000, size=x.shape)}, index=x)
df_x.head(10)
                     price   vol
2017-01-01 00:00:00     56  1544
2017-01-01 00:01:00     70  1680
2017-01-01 00:02:00     92  1853
2017-01-01 00:03:00     94  1039
2017-01-01 00:04:00     81  1180
2017-01-01 00:05:00     70  1443
2017-01-01 00:06:00     56  1621
2017-01-01 00:07:00     68  1093
2017-01-01 00:08:00     59  1684
2017-01-01 00:09:00     86  1591

# Here is some example aggregate function:
df_x.resample('5Min').agg({'price': 'mean', 'vol': 'sum'}).head()
                     price   vol
2017-01-01 00:00:00   78.6  7296
2017-01-01 00:05:00   67.8  7432
2017-01-01 00:10:00   76.0  9017
2017-01-01 00:15:00   74.0  6989
2017-01-01 00:20:00   64.4  8078

However, if I want to extract other aggregated info depends on more than one column, what can I do?

For example, I want to append 2 more columns here, called all_up and all_down.

These 2 columns' calculations are defined as follows:

In every 5 minutes, how many times the 1-minute sampled price went down and vol went down, call this column all_down, and how many times they are went up, call this column all_up.

Here is what I expect the 2 columns look like:

                     price   vol  all_up  all_down
2017-01-01 00:00:00   78.6  7296       2         0
2017-01-01 00:05:00   67.8  7432       0         0
2017-01-01 00:10:00   76.0  9017       1         0
2017-01-01 00:15:00   74.0  6989       1         1
2017-01-01 00:20:00   64.4  8078       0         2

This functionality depends on 2 columns. But in the agg function in the Resampler object, it seems that it only accept 3 kinds of functions:

a str or a function that applies to each of the columns separately.
a list of functions that applies to each of the columns separately.
a dict with keys matches the column names. Still only apply the value which is a function to a single column each time.

All these functionalities seem doesn't meet my needs.

Could you give us an example of what it is you need? What is your expected output? — cs95
– cs95, Commented Dec 22, 2017 at 8:47
@cᴏʟᴅsᴘᴇᴇᴅ, sir, I just added an expected output dataframe, all_up column counts in every 5 minutes, how many 1-minute price goes up and vol also goes up. all_down the opposite way. — StayFoolish
– StayFoolish, Commented Dec 22, 2017 at 9:11
Okay, this is helpful, but I need a little more data. You have only given data for 10 minutes. Can you add data for 20 minutes, which gives your output? — cs95
– cs95, Commented Dec 22, 2017 at 9:15
I think your values for all_down are incorrect. Please check that again? I'm getting a different answer based on my calculations. — cs95
– cs95, Commented Dec 22, 2017 at 9:37
Thanks. I think the price and vol data are random integers, that's probably we get different values? Actually I only manually calculated the first 10 minutes all_up and all_downs. — StayFoolish
– StayFoolish, Commented Dec 22, 2017 at 9:49

ClimbsRocks · Accepted Answer · 2018-03-25 04:48:13Z

27

I think you need instead resample use groupby + Grouper and apply with custom function:

def func(x):
   #code
   a = x['price'].mean()
   #custom function working with 2 columns
   b = (x['price'] / x['vol']).mean()
   return pd.Series([a,b], index=['col1','col2'])

df_x.groupby(pd.Grouper(freq='5Min')).apply(func)

Or use resample for all supported aggreagate functions and join outputs together with outputs of custom function:

def func(x):
    #custom function
    b = (x['price'] / x['vol']).mean()
    return b

df1 = df_x.groupby(pd.Grouper(freq='5Min')).apply(func)
df2 = df_x.resample('5Min').agg({'price': 'mean', 'vol': 'sum'}).head()

df = pd.concat([df1, df2], axis=1)

EDIT: For check decreasing and increasing is used function diff and compare with 0, join both condition with & and count by sum:

def func(x):
    v = x['vol'].diff().fillna(0)
    p = x['price'].diff().fillna(0)
    m1 = (v > 0) & (p > 0)
    m2 = (v < 0) & (p < 0) 
    return pd.Series([m1.sum(), m2.sum()], index=['all_up','all_down'])


df1 = df_x.groupby(pd.Grouper(freq='5min')).apply(func)
print (df1)
                     all_up  all_down
2017-01-01 00:00:00       2         0
2017-01-01 00:05:00       0         0

df2 = df_x.resample('5Min').agg({'price': 'mean', 'vol': 'sum'}).head()
df = pd.concat([df2, df1], axis=1)
print (df)
                      vol  price  all_up  all_down
2017-01-01 00:00:00  7296   78.6       2         0
2017-01-01 00:05:00  7432   67.8       0         0

edited Mar 25, 2018 at 4:48

ClimbsRocks

1,15415 silver badges16 bronze badges

answered Dec 22, 2017 at 8:46

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

StayFoolish Over a year ago

thanks for bringing up the pd.Grouper function, which I didn't know. But I tried your suggested method, it seems not working as what I expected. df_x.groupby(pd.Grouper(freq='5Min')).apply(func)

StayFoolish Over a year ago

It works. Thanks. Just a follow up, does this mean groupby + Grouper + apply is more flexible than resample + apply? Also is there any difference between using apply method and agg or aggregate method?

jezrael Over a year ago

Hmmm, I think so. But not 100% sure. But groupby + apply is nmore common, so more buggy frree and better implemented in my opinion

Collectives™ on Stack Overflow

pandas dataframe resample aggregate function use multiple columns with a customized function?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related