16

Here is an example:

# Generate some random time series dataframe with 'price' and 'volume'
x = pd.date_range('2017-01-01', periods=100, freq='1min')
df_x = pd.DataFrame({'price': np.random.randint(50, 100, size=x.shape), 'vol': np.random.randint(1000, 2000, size=x.shape)}, index=x)
df_x.head(10)
                     price   vol
2017-01-01 00:00:00     56  1544
2017-01-01 00:01:00     70  1680
2017-01-01 00:02:00     92  1853
2017-01-01 00:03:00     94  1039
2017-01-01 00:04:00     81  1180
2017-01-01 00:05:00     70  1443
2017-01-01 00:06:00     56  1621
2017-01-01 00:07:00     68  1093
2017-01-01 00:08:00     59  1684
2017-01-01 00:09:00     86  1591

# Here is some example aggregate function:
df_x.resample('5Min').agg({'price': 'mean', 'vol': 'sum'}).head()
                     price   vol
2017-01-01 00:00:00   78.6  7296
2017-01-01 00:05:00   67.8  7432
2017-01-01 00:10:00   76.0  9017
2017-01-01 00:15:00   74.0  6989
2017-01-01 00:20:00   64.4  8078

However, if I want to extract other aggregated info depends on more than one column, what can I do?

For example, I want to append 2 more columns here, called all_up and all_down.

These 2 columns' calculations are defined as follows:

In every 5 minutes, how many times the 1-minute sampled price went down and vol went down, call this column all_down, and how many times they are went up, call this column all_up.

Here is what I expect the 2 columns look like:

                     price   vol  all_up  all_down
2017-01-01 00:00:00   78.6  7296       2         0
2017-01-01 00:05:00   67.8  7432       0         0
2017-01-01 00:10:00   76.0  9017       1         0
2017-01-01 00:15:00   74.0  6989       1         1
2017-01-01 00:20:00   64.4  8078       0         2

This functionality depends on 2 columns. But in the agg function in the Resampler object, it seems that it only accept 3 kinds of functions:

  • a str or a function that applies to each of the columns separately.
  • a list of functions that applies to each of the columns separately.
  • a dict with keys matches the column names. Still only apply the value which is a function to a single column each time.

All these functionalities seem doesn't meet my needs.

5
  • Could you give us an example of what it is you need? What is your expected output? Commented Dec 22, 2017 at 8:47
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ, sir, I just added an expected output dataframe, all_up column counts in every 5 minutes, how many 1-minute price goes up and vol also goes up. all_down the opposite way. Commented Dec 22, 2017 at 9:11
  • Okay, this is helpful, but I need a little more data. You have only given data for 10 minutes. Can you add data for 20 minutes, which gives your output? Commented Dec 22, 2017 at 9:15
  • I think your values for all_down are incorrect. Please check that again? I'm getting a different answer based on my calculations. Commented Dec 22, 2017 at 9:37
  • Thanks. I think the price and vol data are random integers, that's probably we get different values? Actually I only manually calculated the first 10 minutes all_up and all_downs. Commented Dec 22, 2017 at 9:49

1 Answer 1

27

I think you need instead resample use groupby + Grouper and apply with custom function:

def func(x):
   #code
   a = x['price'].mean()
   #custom function working with 2 columns
   b = (x['price'] / x['vol']).mean()
   return pd.Series([a,b], index=['col1','col2'])

df_x.groupby(pd.Grouper(freq='5Min')).apply(func)

Or use resample for all supported aggreagate functions and join outputs together with outputs of custom function:

def func(x):
    #custom function
    b = (x['price'] / x['vol']).mean()
    return b

df1 = df_x.groupby(pd.Grouper(freq='5Min')).apply(func)
df2 = df_x.resample('5Min').agg({'price': 'mean', 'vol': 'sum'}).head()

df = pd.concat([df1, df2], axis=1)

EDIT: For check decreasing and increasing is used function diff and compare with 0, join both condition with & and count by sum:

def func(x):
    v = x['vol'].diff().fillna(0)
    p = x['price'].diff().fillna(0)
    m1 = (v > 0) & (p > 0)
    m2 = (v < 0) & (p < 0) 
    return pd.Series([m1.sum(), m2.sum()], index=['all_up','all_down'])


df1 = df_x.groupby(pd.Grouper(freq='5min')).apply(func)
print (df1)
                     all_up  all_down
2017-01-01 00:00:00       2         0
2017-01-01 00:05:00       0         0

df2 = df_x.resample('5Min').agg({'price': 'mean', 'vol': 'sum'}).head()
df = pd.concat([df2, df1], axis=1)
print (df)
                      vol  price  all_up  all_down
2017-01-01 00:00:00  7296   78.6       2         0
2017-01-01 00:05:00  7432   67.8       0         0
Sign up to request clarification or add additional context in comments.

3 Comments

thanks for bringing up the pd.Grouper function, which I didn't know. But I tried your suggested method, it seems not working as what I expected. df_x.groupby(pd.Grouper(freq='5Min')).apply(func)
It works. Thanks. Just a follow up, does this mean groupby + Grouper + apply is more flexible than resample + apply? Also is there any difference between using apply method and agg or aggregate method?
Hmmm, I think so. But not 100% sure. But groupby + apply is nmore common, so more buggy frree and better implemented in my opinion

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.