1

Problem: I have 2 dataframes df1 and df2. My goal is to modify df1 by replacing some of its values if found within df2.

import pandas as pd

# dataframe 1
data = {'A':[90,20,30,25,50,60],
        'B':['qq','ee','rr','tt','ii','oo'],
        'C':['XX','VV','BB','NN','KK','JJ']}
df1 = pd.DataFrame(data)

# dataframe 2
convert_table = {'X': ['dd','ee','ff','gg','hh','ii','ll','mm','nn','oo','pp','qq','rr','ss','tt','uu'], 
                 'Y': ['DD','VV','FF','GG','HH','KK','LL','MM','NN','JJ','PP','XX','BB','SS','NN','LL'], 
                 'Z': [5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61]}
df2 = pd.DataFrame(convert_table)

# search values of df1 inside of df2 and replace values
for idx1,row1 in df1.iterrows():
    for idx2, row2 in df2.iterrows():
        if row1['B']==row2['X'] and row1['C']==row2['Y']:
            df1.replace(to_replace=row1['B'],value=row2['Z'],inplace=True) 

As you can see I have 2 for loops and I check if the generic row of df1 (row1) is found inside of df2. If this condition is met, then I replace the value contained in row1['B'] with the one contained in row2['Z']

Therefore the results that I get is (exactly what I would like to have as a result):

In [120]: df1
Out[120]: 
    A   B   C
0  90  43  XX
1  20   7  VV
2  30  47  BB
3  25  59  NN
4  50  19  KK
5  60  37  JJ

Notice how column B has changed.

Question: could you suggest me a better way to write my code? I would like to make it as fast as possible maybe by using the built-in functions offered by Pandas or Python.

Note: the data contained into the dataframes is just for demonstration purposes.

1 Answer 1

3

Use merge on two columns:

df1.merge(df2, left_on=['B','C'], right_on=['X','Y'], how='left')

The how='left' is critical here. Read Brief primer on merge methods (relational algebra) if you don't understand why.

I'll modify your example to create one where there's an entry in df1 that doesn't exist in df2, which is ('ii','KK')

In [1]:
# dataframe 2
convert_table = {'X': ['dd','ee','ff','gg','hh','ll','mm','nn','oo','pp','qq','rr','ss','tt','uu'], 
                 'Y': ['DD','VV','FF','GG','HH','LL','MM','NN','JJ','PP','XX','BB','SS','NN','LL'], 
                 'Z': [5,7,11,13,17,19,23,29,37,41,43,47,53,59,61]}
df2 = pd.DataFrame(convert_table)



In [2]: merged = df1.merge(df2, left_on=['B','C'], right_on=['X','Y'], how='left')
        merged
Out[2]: 
    A   B   C    X    Y     Z
0  90  qq  XX   qq   XX  43.0
1  20  ee  VV   ee   VV   7.0
2  30  rr  BB   rr   BB  47.0
3  25  tt  NN   tt   NN  59.0
4  50  ii  KK  NaN  NaN   NaN
5  60  oo  JJ   oo   JJ  37.0

Now to retrieve the final dataframe:

In [3]:
merged.ix[merged.Z.notnull(),'B'] = merged.ix[merged.Z.notnull(),'Z']
merged = merged[['A','B','C']]
merged

Out[3]:
    A   B   C
0  90  43  XX
1  20   7  VV
2  30  47  BB
3  25  59  NN
4  50  ii  KK
5  60  37  JJ
Sign up to request clarification or add additional context in comments.

2 Comments

Is it possible to get an output that has the same number of columns as the one I got in my example?
I just did this at the same time you posted your comment :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.