4

i have a dataframe df1:

id   age   val
10    2     10
20    3     NaN
30    4     55
40    5     NaN   

I have another dataframe df2:

age  val_main
 1    111
 2    222
 3    333
 4    444
 5    555
 6    666

I want to only replace value from df1 where it is NaN with its corresponding value from df2.

Final output im looking for is:

id   age   val
10    2     10
20    3     333
30    4     55
40    5     555   

i tried iterating the df1 by for loop and then locating the values from df2 where the row in df1 is null.

eg:

for index,row in df1.iterrows():
    if row['val'].isnull():
        df2.loc[df2.age==row.age].val
        .....
        .....

But i'm looking for a more robust and intelligent way of doing this.

4 Answers 4

3

Use Series.map by dictionary with Series.fillna:

d = dict(zip(df2['age'], df2['val_main']))
#alternative with Series
#d = df2.set_index('age')['val_main']
df1['val'] = df1['val'].fillna(df1['age'].map(d))
print (df1)
   id  age    val
0  10    2   10.0
1  20    3  333.0
2  30    4   55.0
3  40    5  555.0

Performance depends of number of rows in both dataFrames and also by number of unique values and number of missing values:

np.random.seed(123)
N = 1000000

df2 = pd.DataFrame({'age': np.arange(1,101),
                    'val_main':np.random.randint(1000, size=100)})
print (df2)

df1 = pd.DataFrame({'id': np.random.randint(1000, size=N),
                   'age':np.random.randint(100, size=N),
                   'val':np.random.choice([10,20,30,np.nan], size=N)})
print (df1)

In [229]: %%timeit
     ...: df = df1.merge(df2, on = 'age', how='left')
     ...: df['val'] = df.val.fillna(df.pop('val_main'))
     ...: 
172 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [230]: %%timeit d = dict(zip(df2['age'], df2['val_main']))
     ...: #alternative with Series
     ...: #d = df2.set_index('age')['val_main']
     ...: df1['val'] = df1['val'].fillna(df1['age'].map(d))
     ...: 
40 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

1 Comment

@anky_91 - I try find some timings
2

You could left .merge and .fillna the values in val with the merged column val_main:

df = df1.merge(df2, on = 'age', how='left')
df['val'] = df.val.fillna(df.pop('val_main'))

Output

   id  age    val
0  10    2   10.0
1  20    3  333.0
2  30    4   55.0
3  40    5  555.0

Comments

1

I think you are looking for this,

df1=pd.merge(df1,df2,on=['age'],how='left')
df1['val']=df1['val'].fillna(df1['val_main'])

df1.drop('val_main',1,inplace=True) #to remove unnecessary column

print (df1)

O/P:

   id  age    val 
0  10    2   10.0 
1  20    3  333.0 
2  30    4   55.0 
3  40    5  555.0 

Comments

0

i like pd.apply too

df1

   id  age   val
0  10    2  10.0
1  20    3   NaN
2  30    4  55.0
3  40    5   NaN

df2

   age  val_main
0    1       111
1    2       222
2    3       333
3    4       444
4    5       555
5    6       666

df1['val'] = df1[['age','val']].apply(lambda x: df2[df2['age'] == x.age].reset_index()['val_main'][0] if str(x.val) == 'nan' else x.val, axis=1 )

Output

   id  age    val
0  10    2   10.0
1  20    3  333.0
2  30    4   55.0
3  40    5  555.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.