import pandas as pd
d = [{'col1' : ' B', 'col2' : '2015-3-06 01:37:57'},
{'col1' : ' A', 'col2' : '2015-3-06 01:39:57'},
{'col1' : ' A', 'col2' : '2015-3-06 01:45:28'},
{'col1' : ' B', 'col2' : '2015-3-06 02:31:44'},
{'col1' : ' B', 'col2' : '2015-3-06 03:55:45'},
{'col1' : ' B', 'col2' : '2015-3-06 04:01:40'}]
df = pd.DataFrame(d)
df['col2'] = pd.to_datetime(df['col2'])
For each row I want to count number of rows with same values of 'col1' and time within window of past 10 minutes before time of this row(include). I'm interested in implementation which work fast
this source work very slow on big dataset:
dt = pd.Timedelta(10, unit='m')
def count1(row):
id1 = row['col1']
start_time = row['col2'] - dt
end_time = row['col2']
mask = (df['col1'] == id1) & ((df['col2'] >= start_time) & (df['col2'] <= end_time))
return df.loc[mask].shape[0]
df['count1'] = df.apply(count1, axis=1)
df.head(6)
col1 col2 count1
0 B 2015-03-06 01:37:57 1
1 A 2015-03-06 01:39:57 1
2 A 2015-03-06 01:45:28 2
3 B 2015-03-06 02:31:44 1
4 B 2015-03-06 03:55:45 1
5 B 2015-03-06 04:01:40 2
Notice: column 'col2' is date sensitive, not only time