Making matching algorithm between two data frames more efficient

Question

I have two data frames eg.

Shorter time frame ( 4 hourly )

Time                  Data_4h
1/1/01 00:00          1.1
1/1/01 06:00          1.2
1/1/01 12:00          1.3
1/1/01 18:00          1.1
2/1/01 00:00          1.1
2/1/01 06:00          1.2
2/1/01 12:00          1.3
2/1/01 18:00          1.1
3/1/01 00:00          1.1
3/1/01 06:00          1.2
3/1/01 12:00          1.3
3/1/01 18:00          1.1

Longer time frame ( 1 day )

Time                  Data_1d
1/1/01 00:00          1.1
2/1/01 00:00          1.6
3/1/01 00:00          1.0

I want to label the shorter time frame data with the data from the longer time frame data but n-1 days, leaving NaN where the n-1 day doesn't exist.

For example,

Final merged data combining 4h and 1d

Time                  Data_4h     Data_1d
1/1/01 00:00          1.1         NaN
1/1/01 06:00          1.2         NaN
1/1/01 12:00          1.3         NaN
1/1/01 18:00          1.1         NaN
2/1/01 00:00          1.1         1.1
2/1/01 06:00          1.2         1.1
2/1/01 12:00          1.3         1.1
2/1/01 18:00          1.1         1.1 
3/1/01 00:00          1.1         1.6
3/1/01 06:00          1.2         1.6
3/1/01 12:00          1.3         1.6
3/1/01 18:00          1.1         1.6

So for 1/1 - it tried to find 31/12 but couldn't find it so it was labelled as NaN. For 2/1, it searched for 1/1 and labelled those entires with 1.1 - the value for 1/1. For 3/1, it searched for 2/1 and labelled those entires with 1.6 - the value for 2/1.

It is important to note that the timeframe datas may have large gaps. So I can't access the rows in the larger time frame directly.

What is the best way to do this?

Currently I am iterating through all the rows of the smaller timeframe and then searching for the larger time frame date using a filter like:

large_tf_data[(large_tf_data.index <= target_timestamp)][0]

Where target_timestamp is calculated on each row in the smaller time frame data frame.

This is extremely slow! Any suggestions on how to speed it up?

Are those dates dayfirst or monthfirst?

cs95
– cs95

2018-05-16 17:32:21 +00:00
Commented May 16, 2018 at 17:32 — cs95
– cs95, Commented May 16, 2018 at 17:32

piRSquared · Accepted Answer · 2018-05-16 17:40:25Z

1

First, take care of dates

dayfirstme = lambda d: pd.to_datetime(d.Time, dayfirst=True)
df = df.assign(Time=dayfirstme)
df2 = df2.assign(Time=dayfirstme)

Then Convert df2 to something useful

d2 = df2.assign(Time=lambda d: d.Time + pd.Timedelta(1, 'D')).set_index('Time').Data_1d

Apply magic

df.join(df.Time.dt.date.map(d2).rename(d2.name))

                  Time  Data_4h  Data_1d
0  2001-01-01 00:00:00      1.1      NaN
1  2001-01-01 06:00:00      1.2      NaN
2  2001-01-01 12:00:00      1.3      NaN
3  2001-01-01 18:00:00      1.1      NaN
4  2001-01-02 00:00:00      1.1      1.1
5  2001-01-02 06:00:00      1.2      1.1
6  2001-01-02 12:00:00      1.3      1.1
7  2001-01-02 18:00:00      1.1      1.1
8  2001-01-03 00:00:00      1.1      1.6
9  2001-01-03 06:00:00      1.2      1.6
10 2001-01-03 12:00:00      1.3      1.6
11 2001-01-03 18:00:00      1.1      1.6

I'm sure there are other ways but I didn't want to think about this anymore.

answered May 16, 2018 at 17:40

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

cs95 Over a year ago

This is how I was going to do it myself, but I wanted to confirm if the dates were dayfirst. This confirms it though.

Collectives™ on Stack Overflow

Making matching algorithm between two data frames more efficient

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related