2

I have a huge timeseries DataFrame (about 100 000 000 rows) and I need to filter rows by conditions. Each condition there are in an each row of an another Dataframe. This Dataframe has about 2000 rows and each row is condition. Toy example:

df = pd.DataFrame({val: [1, 3, 2, 4, 3, 1, 2, 3], date: [2015-03-12, 2015-04-12, 2015-05-13, 2016-03-12, 2016-04-07, 2016-05-12, 2017-01-11, 2017-03-20]})
df_condition = pd.DataFrame({val: [2, 3], date: [2015-07-13, 2016-04-08]})

Condition is remove all raws in df, where val appears earlier than date in df_condition:

df = df[(df['val']==2) & (df['date']>'2015-07-13')]
df = df[(df['val']==3) & (df['date']>'2016-04-08')]

and so on about 2000 conditions

I use this way, but it too long (about 5 hours). Is there a faster method?

vals = df_condition.val.values
dates = vals = df_condition.dates.values
for i in range(len(df_condition)):
    df = df[~((df.val==vals[i])&(df.date < dates[i]))]

1 Answer 1

2

I believe you can create list of all masks and then reduce:

masks = [(df.val==x.val)&(df.date >= x.date) for x in df_condition.itertuples()]
print (masks)
[0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
dtype: bool, 0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
dtype: bool]

df1 = df[np.logical_and.reduce(masks)]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for so elegant and fast way!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.