Pandas rolling apply using multiple columns

Question

I am trying to use a pandas.DataFrame.rolling.apply() rolling function on multiple columns. Python version is 3.7, pandas is 1.0.2.

import pandas as pd

#function to calculate
def masscenter(x):
    print(x); # for debug purposes
    return 0;

#simple DF creation routine
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

'stamp' is monotonic and unique, 'price' is double and contains no NaNs, 'nQty' is integer and also contains no NaNs.

So, I need to calculate rolling 'center of mass', i.e. sum(price*nQty)/sum(nQty).

What I tried so far:

df.apply(masscenter, axis = 1)

masscenter is be called 5 times with a single row and the output will be like

price     87.6
nQty     739.0
Name: 1900-01-01 02:59:47.000282, dtype: float64

It is desired input to a masscenter, because I can easily access price and nQty using x[0], x[1]. However, I stuck with rolling.apply() Reading the docs DataFrame.rolling() and rolling.apply() I supposed that using 'axis' in rolling() and 'raw' in apply one achieves similiar behaviour. A naive approach

rol = df.rolling(window=2)
rol.apply(masscenter)

prints row by row (increasing number of rows up to window size)

stamp
1900-01-01 02:59:47.000282    87.60
1900-01-01 03:00:01.042391    87.51
dtype: float64

then

stamp
1900-01-01 02:59:47.000282    739.0
1900-01-01 03:00:01.042391     10.0
dtype: float64

So, columns is passed to masscenter separately (expected).

Sadly, in the docs there is barely any info about 'axis'. However the next variant was, obviously

rol = df.rolling(window=2, axis = 1)
rol.apply(masscenter)

Never calls masscenter and raises ValueError in rol.apply(..)

> Length of passed values is 1, index implies 5

I admit that I'm not sure about 'axis' parameter and how it works due to lack of documentation. It is the first part of the question: What is going on here? How to use 'axis' properly? What it is designed for?

Of course, there were answers previously, namely:

How-to-apply-a-function-to-two-columns-of-pandas-dataframe
It works for the whole DataFrame, not Rolling.

How-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column
The answer suggests to write my own roll function, but the culprit for me is the same as asked in comments: what if one needs to use offset window size (e.g. '1T') for non-uniform timestamps?
I don't like the idea to reinvent the wheel from scratch. Also I'd like to use pandas for everything to prevent inconsistency between sets obtained from pandas and 'self-made roll'. There is another answer to that question, suggessting to populate dataframe separately and calculate whatever I need, but it will not work: the size of stored data will be enormous. The same idea presented here:
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments

Another Q & A posted here
Pandas-using-rolling-on-multiple-columns
It is good and the closest to my problem, but again, there is no possibility to use offset window sizes (window = '1T').

Some of the answers were asked before pandas 1.0 came out, and given that docs could be much better, I hope it is possible to roll over multiple columns simultaneously now.

The second part of the question is: Is there any possibility to roll over multiple columns simultaneously using pandas 1.0.x with offset window size?

@HighGPA the 'masscenter' function was constructed this way to create a minimal, reproducible example. Have you declared columns=['stamp', 'price','nQty']? — Suthiro
– Suthiro, Commented Aug 13, 2022 at 14:55

Asclepius · Accepted Answer · 2023-09-17 03:31:52Z

47

How about this:

import pandas as pd

def masscenter(ser: pd.Series, df: pd.DataFrame):
    df_roll = df.loc[ser.index]
    return your_actual_masscenter(df_roll)

masscenter_output = df['price'].rolling(window=3).apply(masscenter, args=(df,))

It uses the rolling logic to get subsets via an arbitrary column. The arbitrary column itself is not used, only the rolling index is used. This relies on the default of raw=False which provides the index values for those subsets. The applied function uses those index values to get multi-column slices from the original dataframe.

edited Sep 17, 2023 at 3:31

Asclepius

64.6k20 gold badges188 silver badges164 bronze badges

answered Mar 29, 2020 at 17:27

adr

2,14116 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Suthiro Over a year ago

This is the answer at least for the second part. I knew that it should be possible. Such simple and pythonic solution! Thank you very much!

Harald Husum Over a year ago

This is a very expensive solution, though, for larger datasets.

adr Over a year ago

Could you elaborate on your observation?

falsePockets Over a year ago

"then you use those index values to get multi-column slices from your original DataFrame" - do you mean .iloc[i, c] to reach into the dataframe from inside masscenter? i.e. masscenter doesn't have the arguments it needs directly?

w00dy Over a year ago

Works for me though pandas' devs should really implement this somehow...

|

saninstein · Accepted Answer · 2020-03-18 16:11:32Z

20

You can use rolling_apply function from numpy_ext module:

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply


def masscenter(price, nQty):
    return np.sum(price * nQty) / np.sum(nQty)


df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

window = 2
df['y'] = rolling_apply(masscenter, window, df.price.values, df.nQty.values)
print(df)

                            price  nQty          y
stamp                                             
1900-01-01 02:59:47.000282  87.60   739        NaN
1900-01-01 03:00:01.042391  87.51    10  87.598798
1900-01-01 03:00:01.630182  87.51    10  87.510000
1900-01-01 03:00:01.635150  88.00   792  87.993890
1900-01-01 03:00:01.914104  88.00    10  88.000000

answered Mar 18, 2020 at 16:11

saninstein

3011 silver badge3 bronze badges

2 Comments

Suthiro Over a year ago

Thank you, but alas! It also accepts fixed window size - 2 (or whatever number) of points, but not seconds or another so-called offsets. However, you gave me an idea. I will try it and if it works will post soon.

shaunc Over a year ago

If your data is not too sparse, you could use rolling_apply as suggested using a window large enough to encompass given time offset records, and incorporate a bounds check against the stamp inside the applied function. You might need to use large windows, but potentially rolling_apply's ability to execute parallel jobs might make up for that.

Hamid Fadishei · Accepted Answer · 2022-08-21 17:03:09Z

7

For performing a rolling window operation with access to all columns of a dataframe, you can pass mehtod='table' to rolling(). Example:

import pandas as pd
import numpy as np
from numba import jit

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 'b': [1, 3, 5, 7, 9, 11]})

@jit
def f(w):
    # we have access to both columns of the dataframe here
    return np.max(w), np.min(w)

df.rolling(3, method='table').apply(f, raw=True, engine='numba')

It should be noted that method='table' requires numba engine (pip install numba). The @jit part in the example is not mandatory but helps with performance. The result of the above example code will be:

a	b
NaN	NaN
NaN	NaN
5.0	1.0
7.0	2.0
9.0	3.0
11.0	4.0

answered Aug 21, 2022 at 17:03

Hamid Fadishei

8782 gold badges10 silver badges16 bronze badges

Comments

Contango · Accepted Answer · 2021-05-27 14:08:46Z

6

With reference to the excellent answer from @saninstein.

Install numpy_ext from: https://pypi.org/project/numpy-ext/

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply as rolling_apply_ext

def box_sum(a,b):
    return np.sum(a) + np.sum(b)

df = pd.DataFrame({"x": [1,2,3,4], "y": [1,2,3,4]})

window = 2
df["sum"] = rolling_apply_ext(box_sum, window , df.x.values, df.y.values)

Output:

print(df.to_string(index=False))
 x  y  sum
 1  1  NaN
 2  2  6.0
 3  3 10.0
 4  4 14.0

Notes

The rolling function is timeseries friendly. It defaults to always looking backwards, so the 6 is the sum of present and past values in the array.
In the sample above, imported rolling_apply as rolling_apply_ext so it cannot possibly interfere with any existing calls to Pandas rolling_apply (thanks to comment by @LudoSchmidt).

As a side note, I gave up trying to use Pandas. It's fundamentally broken: it handles single-column aggreagate and apply with little problems, but it's a overly complex rube-goldberg machine when trying to get it to work with more two columns or more.

edited May 27, 2021 at 14:08

answered Apr 24, 2021 at 16:30

Contango

81k59 gold badges283 silver badges324 bronze badges

4 Comments

Ludo Schmidt Over a year ago

personal expérience with your respond : I pip install and used "from numpy_ext import rolling_apply". But it destroyed my pandas in my script.

Contango Over a year ago

@LudoSchmidt Good point. Updated code above to import rolling_apply as rolling_apply_ext, so everything is backwards compatible with the existing rolling_apply calls in Pandas.

Suthiro Over a year ago

"I gave up trying to use Pandas. It's fundamentally broken" - this, unfortunately.

Tom Over a year ago

how would you do this on a GroupBy?

Suthiro · Accepted Answer · 2020-03-24 16:11:49Z

So I found no way to roll over two columns, however without inbuilt pandas functions. The code is listed below.

# function to find an index corresponding
# to current value minus offset value
def prevInd(series, offset, date):
    offset = to_offset(offset)
    end_date = date - offset
    end = series.index.searchsorted(end_date, side="left")
    return end

# function to find an index corresponding
# to the first value greater than current
# it is useful when one has timeseries with non-unique
# but monotonically increasing values
def nextInd(series, date):
    end = series.index.searchsorted(date, side="right")
    return end

def twoColumnsRoll(dFrame, offset, usecols, fn, columnName = 'twoColRol'):
    # find all unique indices
    uniqueIndices = dFrame.index.unique()
    numOfPoints = len(uniqueIndices)
    # prepare an output array
    moving = np.zeros(numOfPoints)
    # nameholders
    price = dFrame[usecols[0]]
    qty   = dFrame[usecols[1]]

    # iterate over unique indices
    for ii in range(numOfPoints):
        # nameholder
        pp = uniqueIndices[ii]
        # right index - value greater than current
        rInd = afta.nextInd(dFrame,pp)
        # left index - the least value that 
        # is bigger or equal than (pp - offset)
        lInd = afta.prevInd(dFrame,offset,pp)
        # call the actual calcuating function over two arrays
        moving[ii] = fn(price[lInd:rInd], qty[lInd:rInd])
    # construct and return DataFrame
    return pd.DataFrame(data=moving,index=uniqueIndices,columns=[columnName])

This code works, but it is relatively slow and inefficient. I suppose one can use numpy.lib.stride_tricks from How to invoke pandas.rolling.apply with parameters from multiple column? to speedup things. However, go big or go home - I ended writing a function in C++ and a wrapper for it.
I'd like not to post it as answer, since it is a workaround and I have not answered neither part of my question, but it is too long for a commentary.

manbui · Accepted Answer · 2024-01-24 08:38:31Z

1

(df['price'] * df['nQty']).rolling(2).sum() / df['nQty'].rolling(2).sum()

# output
    stamp
1900-01-01 02:59:47.000282          NaN
1900-01-01 03:00:01.042391    87.598798
1900-01-01 03:00:01.630182    87.510000
1900-01-01 03:00:01.635150    87.993890
1900-01-01 03:00:01.914104    88.000000
dtype: float64

You can use rolling sum for price*nQty and nQty part then calculating the mean. The same solution can be used with offset window size.

edited Jan 24, 2024 at 8:38

answered Jan 24, 2024 at 8:36

manbui

113 bronze badges

1 Comment

Suthiro Over a year ago

This works, but one needs to compute rolling more than once, so the solution is (very) slow for large datasets.

Anibal Yeh · Accepted Answer · 2022-06-30 09:21:04Z

0

How about this?

ggg = pd.DataFrame({"a":[1,2,3,4,5,6,7], "b":[7,6,5,4,3,2,1]})

def my_rolling_apply2(df, fun, window):
    prepend = [None] * (window - 1)
    end = len(df) - window
    mid = map(lambda start: fun(df[start:start + window]), np.arange(0,end))
    last =  fun(df[end:])
    return [*prepend, *mid, last]

my_rolling_apply2(ggg, lambda df: (df["a"].max(), df["b"].min()), 3)

And result is:

[None, None, (3, 5), (4, 4), (5, 3), (6, 2), (7, 1)]

answered Jun 30, 2022 at 9:21

Anibal Yeh

3511 silver badge12 bronze badges

Collectives™ on Stack Overflow

Pandas rolling apply using multiple columns

7 Answers 7

15 Comments

2 Comments

Comments

4 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

15 Comments

2 Comments

Comments

4 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related