pandas: count rows within time moving window

Question

import pandas as pd
d = [{'col1' : ' B', 'col2' : '2015-3-06 01:37:57'},
       {'col1' : ' A', 'col2' : '2015-3-06 01:39:57'},
       {'col1' : ' A', 'col2' : '2015-3-06 01:45:28'},
       {'col1' : ' B', 'col2' : '2015-3-06 02:31:44'},
       {'col1' : ' B', 'col2' : '2015-3-06 03:55:45'},
       {'col1' : ' B', 'col2' : '2015-3-06 04:01:40'}]
df = pd.DataFrame(d)
df['col2'] = pd.to_datetime(df['col2'])

For each row I want to count number of rows with same values of 'col1' and time within window of past 10 minutes before time of this row(include). I'm interested in implementation which work fast

this source work very slow on big dataset:

dt = pd.Timedelta(10, unit='m')
def count1(row):
    id1 = row['col1']
    start_time = row['col2'] - dt
    end_time = row['col2']
    mask = (df['col1'] == id1) & ((df['col2'] >= start_time) & (df['col2'] <= end_time))
    return df.loc[mask].shape[0]

df['count1'] = df.apply(count1, axis=1)

df.head(6)

    col1    col2    count1
0   B   2015-03-06 01:37:57     1
1   A   2015-03-06 01:39:57     1
2   A   2015-03-06 01:45:28     2
3   B   2015-03-06 02:31:44     1
4   B   2015-03-06 03:55:45     1
5   B   2015-03-06 04:01:40     2

Notice: column 'col2' is date sensitive, not only time

Jesse · Accepted Answer · 2018-03-18 14:59:05Z

3

The problem is, that apply is very expensive. One option is to optimize the code via cython or with the use of numba.

This might be helpful.

Another option is the following:

Create a column with timestamps from col2
Create a column with ids which group the timestamps by your 10 min criterium
Create a combined column with the previous created ids and col1 as in df['time_ids'].map(str) + df['col1']
Use groupby to determine the number of equal rows. Something like: df.groupby(df['combined_ids']).size()

answered Mar 18, 2018 at 14:59

Jesse

3702 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Andy Wong · Accepted Answer · 2020-03-18 10:30:34Z

0

Try to use

df.col2=pd.to_datetime(df.col2)
df.groupby([pd.Grouper(key='col2',freq='H'),df.col1]).size().reset_index(name='count')

answered Mar 18, 2020 at 10:30

Andy Wong

4,5941 gold badge24 silver badges20 bronze badges

Collectives™ on Stack Overflow

pandas: count rows within time moving window

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related