0

The following data frame is used as input:

import pandas as pd
import numpy as np

json_string = '{"datetime":{"0":1528955662000,"1":1528959255000,"2":1528965487000,"3":1528966204000,"4":1528966289000,"5":1528971637000,"6":1528974438000,"7":1528975251000,"8":1528982200000,"9":1528992569000,"10":1528994282000},"hit":{"0":1,"1":0,"2":0,"3":0,"4":0,"5":1,"6":1,"7":0,"8":1,"9":0,"10":1}}'
df = pd.read_json(json_string)

The exercise requires you to compute the mean of the hit column for each moment in time (datetime). However, the current observation should not be included in the mean. For instance, the first observation (index=0) gets np.NaN since there are no observations apart from the one we're calculating the mean for. The second observation (index=1) gets 1 since 1/1 = 1 (0 from the second observation is not included). The third observation (index=2) gets 0.5 since (1+0)/2=0.5.

My code provides a correct answer (in terms of numbers) but is not elegant. I wonder whether you can complete the exercise with something different. Is it possible to use the pandas.api.indexers.VariableOffsetWindowIndexer or pandas.api.indexers.BaseIndexer and then get_window_bounds() method?

My solution:

def add_hr(df):
    """
    Generate a feature `mean_hr` which represents the average hit rate
    at the moment of making the offer (`datetime`).

    Parameters
    ----------
    df : pandas.DataFrame
        The `hit` column must be present. Ascending/descending order in the `datetime`
        column is not assumed.

        hit : int
        datetime : string (format='%Y-%m-%d %H:%M:%S')

    Returns
    ----------
    df_expanded : pandas.DataFrame
        A (deep) copy of the input pandas.DataFrame.
    """

    df_expanded = df.copy(deep=True)

    df_expanded.sort_values(by=['datetime'], ascending=True, inplace=True)

    df_expanded['mean_hr'] = df_expanded['hit'].expanding().mean()

    srs = df_expanded['mean_hr']

    srs = srs[:len(srs)-1]
    srs = pd.concat([pd.Series([np.nan]), srs])
    df_expanded['mean_hr'] = srs.tolist()

    return df_expanded

Full disclaimer: The exercise was a part of a recruitment process a month ago. The recruitment is now closed and I can't submit code anymore.

2 Answers 2

4

A simpler version of what you are trying to achieve is simply to shift the index of the expanding mean as below

df.sort_values(by=['datetime'], inplace=True)
df['mean_hit'] = df.expanding().mean().shift(1)
Sign up to request clarification or add additional context in comments.

Comments

2

It seems that the problem can be solved by subclassing the BaseIndexer class:

from pandas.api.indexers import BaseIndexer

class CustomIndexer(BaseIndexer):
    
    def get_window_bounds(self, num_values, min_periods, center, closed, step):
        
        end = np.arange(0, num_values, step, dtype='int64')
        start = np.zeros(len(end), dtype='int64')
                
        return start, end  
    
indexer = CustomIndexer(window_size=0)

df_expanded = df.copy(deep=True)

df_expanded.hit = df_expanded.hit.rolling(indexer).mean()

3 Comments

Somehow I get this error with your answer: ValueError: CustomIndexer does not implement the correct signature for get_window_bounds
In order for this to work the json sample of the question needs to avoid the datetime parsing via: df = pd.read_json(json_string, convert_dates=False) otherwise the line with rolling raises an exception: DataError: Cannot aggregate non-numeric type: datetime64[ns]
A better alternative is to restrict the rolling mean just to the hit column via: df_expanded.hit = df_expanded.hit.rolling(indexer).mean()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.