1

I am trying to code a task for work, so I made a simple case that simulates this task. I have two dataframes: data_1 and data_2, and I would like to replace some rows in data_1 with rows from data_2 by condition of matching values in a column named time.

Here is an example:

import numpy as np
import pandas as pd
a = {
    'time':[1,2,3,4,5,6],
    'column_1':[2,2,2,2,2,2],
    'column_2':[3,3,3,3,3,3]   
}
b = {
    'time':[3,4,5],
    'column_1':[0,0,0],
    'column_2':[0,0,0]    
}
data_1 = pd.DataFrame(a)
data_2 = pd.DataFrame(b)

In the result, I would like to get dataframe like this:

time   column_1   column_2
0   1   2   3
1   2   2   3
2   3   0   0
3   4   0   0
4   5   0   0
5   6   2   3

I tried merge and replace methods in Pandas, but it was not successful. I did mask boolean array:

time_1 = list(data_1['time'])
time_2 = list(data_2['time'])
mask_array = np.zeros(len(time_1),dtype = bool)
for i, item in enumerate(time_1):
    if item in time_2:
        mask_array[i] = True

and I received:

array([False, False,  True,  True,  True, False])

But I could not replace data_1 values with data_2 values. What did I do wrong? It does not seem like a difficult task, but I could not find anything useful and just do not know what to do. I do not have a lot of experience with pandas, so maybe I do not understand something.

1
  • data_2.set_index("time").combine_first(data_1.set_index("time")).reset_index()? Commented Oct 4, 2021 at 16:28

1 Answer 1

1

You can use .update() after setting index on time on both data_1a and data_1b, as follows:

data_1a = data_1.set_index('time')
data_1a.update(data_2.set_index('time'))
data_out = data_1a.reset_index()

.update() modifies in place using non-NA values from another DataFrame. Aligns on indices. Thus, when you set time as index on both data_1a and data_1b, .update() aligns on matching values in column time to perform the update of data_1 by corresponding values of data_2.

Data Setup:

a = {
    'time':[1,2,3,4,5,6],
    'column_1':[2,2,2,2,2,2],
    'column_2':[3,3,3,3,3,3]   
}
b = {
    'time':[3,4,5],
    'column_1':[0,0,0],
    'column_2':[0,0,0]    
}
data_1 = pd.DataFrame(a)
data_2 = pd.DataFrame(b)

Result:

print(data_out)

   time  column_1  column_2
0     1       2.0       3.0
1     2       2.0       3.0
2     3       0.0       0.0
3     4       0.0       0.0
4     5       0.0       0.0
5     6       2.0       3.0
Sign up to request clarification or add additional context in comments.

2 Comments

Hello! Your decision helped me, thank you! I didn't know about update() method, interesting!
@gleb_shm It's a relatively less commonly known function but handy when the usage condition met. Make good use of it. Sometimes people may use the combine_first() method as mentioned in comment to achieve it. But in that case you are reversing the subject and object by updating data_2 with data_1. Though could give same result but counterintuitive to reverse the subject and object and prone to error for some edge cases.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.