4

I have a DataFrame , say df1, which has all the columns correct except the 'Employee' column. There is another DataFrame , say df2, which has correct Employee names but stored in the column 'Staff'. I am trying to update df1 based on 'key_df1' and 'key_df2' from the respective DataFrames. Need some help on how to approach this question. (Please see below the expected output in the image)

data1=[['NYC-URBAN','JON','$5000','yes','BANKING','AC32456'],['WDC-RURAL','XING','$4500','Yes','FINANCE','AD45678'],['LONDON-URBAN','EDWARDS','$3500','No','IT','DE43216'],
     ['SINGAPORE-URBAN','WOLF','$5000','No','SPORTS','RT45327'],['MUMBAI-RURAL','NEMBIAR','$2500','No','IT','Rs454457']]

data2=[['NYC','MIKE','BANKING','BIKING','AH56245'],['WDC','ALPHA','FINANCE','TREKKING','AD45678'],
     ['LONDON-URBAN','BETA','FINANCE','SLEEPING','DE43216'],['SINGAPORE','WOLF','SPORTS','DANCING','RT45307'],
     ['MUMBAI','NEMBIAR','IT','ZUDO','RS454453']]

List1=['City','Employee', 'Income','Travelling','Industry', 'Key_df1']
List2=['City','Staff','Industry','Hobby', 'Key_df1']

df1=pd.DataFrame(data1,columns=List1)
df2=pd.DataFrame(data2,columns=List2)

Expected Ouput:

enter image description here

Edit (Additional Query):

Thanks for the response. Along with the above question, I want to concatenate value of 'Employee' column with the 'Travelling' Column from df1 only for the rows in which the Key_df1 and Key_df2 ties in the two DataFrames. Please see below the second expected output.

enter image description here

3 Answers 3

4

First set the index in df1 to Key_df1 and save it as a temporary DataFrame:

wrk = df1.set_index('Key_df1')

Then update (in-place) its Employee column using df2 with the index set to Key_df2, taking only Staff column:

wrk.Employee.update(df2.set_index('Key_df2').Staff)

And the last operation is to change the index to a "regular" column and move it to the previous location:

result = wrk.reset_index().reindex(columns=List1)

The result is:

              City Employee Income Travelling Industry   Key_df1
0        NYC-URBAN      JON  $5000        yes  BANKING   AC32456
1        WDC-RURAL    ALPHA  $4500        Yes  FINANCE   AD45678
2     LONDON-URBAN     BETA  $3500         No       IT   DE43216
3  SINGAPORE-URBAN     WOLF  $5000         No   SPORTS   RT45327
4     MUMBAI-RURAL  NEMBIAR  $2500         No       IT  Rs454457

Edit following the comment about Travelling column

Now just update is not enough and the task must be solved another way.

Start from joining df1 with df2.Staff (with set_index to join properly):

result = df1.join(df2.set_index('Key_df2').Staff, on='Key_df1')

The second step (the actual update) is:

result.Employee.where(result.Staff.isna(), result.Staff + '_' + result.Travelling,
    inplace=True)

And the last step is to drop Staff column (not necessary any more):

result.drop(columns=['Staff'], inplace=True)

The final result is:

              City   Employee Income Travelling Industry   Key_df1
0        NYC-URBAN        JON  $5000        yes  BANKING   AC32456
1        WDC-RURAL  ALPHA_Yes  $4500        Yes  FINANCE   AD45678
2     LONDON-URBAN    BETA_No  $3500         No       IT   DE43216
3  SINGAPORE-URBAN       WOLF  $5000         No   SPORTS   RT45327
4     MUMBAI-RURAL    NEMBIAR  $2500         No       IT  Rs454457
Sign up to request clarification or add additional context in comments.

2 Comments

Hi @Validi_Bo, thanks for the response. I am also trying to concatenate the updated 'Employee' column with 'Travelling' column. Could you pls help with this?
I have added expected output in the question as well.
2

You can use Boolean Indexing, e.g.:

mask = df1.Key_df1 == df2.Key_df1.reindex(df1.index)
df1.loc[mask, 'Employee'] = df2.Staff

Output:

              City Employee Income Travelling Industry   Key_df1
0        NYC-URBAN      JON  $5000        yes  BANKING   AC32456
1        WDC-RURAL    ALPHA  $4500        Yes  FINANCE   AD45678
2     LONDON-URBAN     BETA  $3500         No       IT   DE43216
3  SINGAPORE-URBAN     WOLF  $5000         No   SPORTS   RT45327
4     MUMBAI-RURAL  NEMBIAR  $2500         No       IT  Rs454457

6 Comments

Hi your original answer with df1.Employee[mask] = df2.Staff seemed to work is there any reason you changed it to df1.loc[mask, 'Employee'] = df2.Staff ?
ok thanks for the reply, first one was shorter, I liked it.
FWIW, you can do it in one line if you wish: df1.loc[df1.Key_df1 == df2.Key_df1, 'Employee'] = df2.Staff.
@FelipeLanza In case df2 has different number of rows df1.Employee[mask] = df2.Staff this throws error. Is there a more generic approach because i had made up this a simple example. In real i have different rows in df2 than df1
Just re-index the smaller one with the index from the other. I've edited it.
|
1

You can also use numpy where:

import numpy as np

df1['Employee'] = np.where(df1['Key_df1'] == df2['Key_df1'], df2['Staff'], df1['Employee'])

2 Comments

True, just bear in mind that isin is not the same as an equality check.
@FelipeLanza You are right. It can work in this case, but it would be risky to use it for a big dataframe. I've edited my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.