How to merge two different size DataFrames in Pandas to update one dataframe depends on matching partial values in one column with another dataframe

Question

I am trying to code a task for work, so I made a simple case that simulates this task. I have two dataframes: data_1 and data_2, and I would like to replace some rows in data_1 with rows from data_2 by condition of matching values in a column named time.

Here is an example:

import numpy as np
import pandas as pd
a = {
    'time':[1,2,3,4,5,6],
    'column_1':[2,2,2,2,2,2],
    'column_2':[3,3,3,3,3,3]   
}
b = {
    'time':[3,4,5],
    'column_1':[0,0,0],
    'column_2':[0,0,0]    
}
data_1 = pd.DataFrame(a)
data_2 = pd.DataFrame(b)

In the result, I would like to get dataframe like this:

time   column_1   column_2
0   1   2   3
1   2   2   3
2   3   0   0
3   4   0   0
4   5   0   0
5   6   2   3

I tried merge and replace methods in Pandas, but it was not successful. I did mask boolean array:

time_1 = list(data_1['time'])
time_2 = list(data_2['time'])
mask_array = np.zeros(len(time_1),dtype = bool)
for i, item in enumerate(time_1):
    if item in time_2:
        mask_array[i] = True

and I received:

array([False, False,  True,  True,  True, False])

But I could not replace data_1 values with data_2 values. What did I do wrong? It does not seem like a difficult task, but I could not find anything useful and just do not know what to do. I do not have a lot of experience with pandas, so maybe I do not understand something.

data_2.set_index("time").combine_first(data_1.set_index("time")).reset_index()? — Henry Yik
– Henry Yik, Commented Oct 4, 2021 at 16:28

SeaBean · Accepted Answer · 2021-10-05 10:09:17Z

1

You can use .update() after setting index on time on both data_1a and data_1b, as follows:

data_1a = data_1.set_index('time')
data_1a.update(data_2.set_index('time'))
data_out = data_1a.reset_index()

.update() modifies in place using non-NA values from another DataFrame. Aligns on indices. Thus, when you set time as index on both data_1a and data_1b, .update() aligns on matching values in column time to perform the update of data_1 by corresponding values of data_2.

Data Setup:

a = {
    'time':[1,2,3,4,5,6],
    'column_1':[2,2,2,2,2,2],
    'column_2':[3,3,3,3,3,3]   
}
b = {
    'time':[3,4,5],
    'column_1':[0,0,0],
    'column_2':[0,0,0]    
}
data_1 = pd.DataFrame(a)
data_2 = pd.DataFrame(b)

Result:

print(data_out)

   time  column_1  column_2
0     1       2.0       3.0
1     2       2.0       3.0
2     3       0.0       0.0
3     4       0.0       0.0
4     5       0.0       0.0
5     6       2.0       3.0

edited Oct 5, 2021 at 10:09

answered Oct 4, 2021 at 16:40

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

gleb_shm Over a year ago

Hello! Your decision helped me, thank you! I didn't know about update() method, interesting!

SeaBean Over a year ago

@gleb_shm It's a relatively less commonly known function but handy when the usage condition met. Make good use of it. Sometimes people may use the combine_first() method as mentioned in comment to achieve it. But in that case you are reversing the subject and object by updating data_2 with data_1. Though could give same result but counterintuitive to reverse the subject and object and prone to error for some edge cases.

Collectives™ on Stack Overflow

How to merge two different size DataFrames in Pandas to update one dataframe depends on matching partial values in one column with another dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related