Pandas - aggregate values with a variable-length rolling window

Question

The following data frame is used as input:

import pandas as pd
import numpy as np

json_string = '{"datetime":{"0":1528955662000,"1":1528959255000,"2":1528965487000,"3":1528966204000,"4":1528966289000,"5":1528971637000,"6":1528974438000,"7":1528975251000,"8":1528982200000,"9":1528992569000,"10":1528994282000},"hit":{"0":1,"1":0,"2":0,"3":0,"4":0,"5":1,"6":1,"7":0,"8":1,"9":0,"10":1}}'
df = pd.read_json(json_string)

The exercise requires you to compute the mean of the hit column for each moment in time (datetime). However, the current observation should not be included in the mean. For instance, the first observation (index=0) gets np.NaN since there are no observations apart from the one we're calculating the mean for. The second observation (index=1) gets 1 since 1/1 = 1 (0 from the second observation is not included). The third observation (index=2) gets 0.5 since (1+0)/2=0.5.

My code provides a correct answer (in terms of numbers) but is not elegant. I wonder whether you can complete the exercise with something different. Is it possible to use the pandas.api.indexers.VariableOffsetWindowIndexer or pandas.api.indexers.BaseIndexer and then get_window_bounds() method?

My solution:

def add_hr(df):
    """
    Generate a feature `mean_hr` which represents the average hit rate
    at the moment of making the offer (`datetime`).

    Parameters
    ----------
    df : pandas.DataFrame
        The `hit` column must be present. Ascending/descending order in the `datetime`
        column is not assumed.

        hit : int
        datetime : string (format='%Y-%m-%d %H:%M:%S')

    Returns
    ----------
    df_expanded : pandas.DataFrame
        A (deep) copy of the input pandas.DataFrame.
    """

    df_expanded = df.copy(deep=True)

    df_expanded.sort_values(by=['datetime'], ascending=True, inplace=True)

    df_expanded['mean_hr'] = df_expanded['hit'].expanding().mean()

    srs = df_expanded['mean_hr']

    srs = srs[:len(srs)-1]
    srs = pd.concat([pd.Series([np.nan]), srs])
    df_expanded['mean_hr'] = srs.tolist()

    return df_expanded

Full disclaimer: The exercise was a part of a recruitment process a month ago. The recruitment is now closed and I can't submit code anymore.

Phik · Accepted Answer · 2020-12-12 17:46:22Z

4

A simpler version of what you are trying to achieve is simply to shift the index of the expanding mean as below

df.sort_values(by=['datetime'], inplace=True)
df['mean_hit'] = df.expanding().mean().shift(1)

answered Dec 12, 2020 at 17:46

Phik

4343 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

yucer · Accepted Answer · 2024-01-08 18:32:08Z

2

It seems that the problem can be solved by subclassing the BaseIndexer class:

from pandas.api.indexers import BaseIndexer

class CustomIndexer(BaseIndexer):
    
    def get_window_bounds(self, num_values, min_periods, center, closed, step):
        
        end = np.arange(0, num_values, step, dtype='int64')
        start = np.zeros(len(end), dtype='int64')
                
        return start, end  
    
indexer = CustomIndexer(window_size=0)

df_expanded = df.copy(deep=True)

df_expanded.hit = df_expanded.hit.rolling(indexer).mean()

edited Jan 8, 2024 at 18:32

yucer

5,1773 gold badges41 silver badges47 bronze badges

answered Sep 6, 2020 at 13:25

balkon16

1,4675 gold badges21 silver badges47 bronze badges

3 Comments

yucer Over a year ago

Somehow I get this error with your answer: ValueError: CustomIndexer does not implement the correct signature for get_window_bounds

yucer Over a year ago

In order for this to work the json sample of the question needs to avoid the datetime parsing via: df = pd.read_json(json_string, convert_dates=False) otherwise the line with rolling raises an exception: DataError: Cannot aggregate non-numeric type: datetime64[ns]

yucer Over a year ago

A better alternative is to restrict the rolling mean just to the hit column via: df_expanded.hit = df_expanded.hit.rolling(indexer).mean()

Collectives™ on Stack Overflow

Pandas - aggregate values with a variable-length rolling window

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related