3

I have two dataframes, DF1(33, 92) and DF2(11, 18) , I want to copy the DF2 18 columns to DF1 based on a matching value from a column name 'ID' in both DFs, these 18 columns have the same names in both dataframes.

I used the following merge: finaldf = pd.merge(DF1, DF2, on = 'ID', how ='left')

This works fine except it changed the 18 columns names in the DF1 and added another 18 columns. so the final dataframe shape was (33, 109) while it supposes to have DF1 shape (33, 92) but with an updated rows.

3
  • If "ID" is "index" column in DF1 then we cant achieve this requirement, as we cant have 2 rows with same index ID. Commented May 24, 2021 at 4:48
  • Look at the documentation for pandas merge. Merge method will add suffixes if there are same column names in both dfs, to distinguish between them. You can later rename/remove the extra ones. Commented May 24, 2021 at 5:26
  • Does this answer your question? pandas left join and update existing column Commented May 24, 2021 at 5:34

2 Answers 2

4

Your finaldf after merge has shape (33, 109) because it has columns with similar names but _x and _y appended to them. _x ones are from DF1 and _y ones are from DF2.

You need to run the below code after merge to remove the extra "_x" and "_y" columns for those 18 and copy the values from DF2 to DF1 where they matched on "ID":

remove_cols = []

for col in DF2.columns:
    if col == 'ID':
        continue
    finaldf[col] = finaldf[col+'_y'].fillna(finaldf[col+'_x'])
    remove_cols += [col+'_x', col+'_y']

finaldf.drop(remove_cols, axis=1, inplace=True)

For more information on why "_x" and "_y" columns appear in your merged dataframe, I would recommend you to check the official documentation of pd.DataFrame.merge method once. "_x" and "_y" are suffixes that merge operation adds by default to distinguish between columns with similar names.


Alternatively:

pd.DataFrame.update is a method in pandas to achieve what you are trying to do.

Check it out here. But there is one caveat with using it, which is that if you have NaN values in DF2 that you would like to copy to DF1, then it won't do that. It will update only non-NA values:

Modify in place using non-NA values from another DataFrame.

Sign up to request clarification or add additional context in comments.

11 Comments

thanks for your detailed reply, I will give it a shot, but does it help if I change 18 columns names in DF2 and map them to the corresponding 18 columns in DF1 to avoid duplicates names ? if so. does the below line work. finaldf = pd.merge(DF1[cols_18], DF2, on = 'ID', how ='left')
the 18 column names in DF1 and DF2 are already same right? Thats why '_x' and '_y' are getting added by merge.
thanks, right, what if I change DF2 columns names, how I would map them back to the DF1 while they`re having different columns names? does it work like that ?
Hi, sorry I was inactive. I didn't understand your question "skip a populated data within 18 columns of _x after merging". If you have some new requirement, scenario that you would like to cover, you can always post a new question. That way you will be able to explain your use-case better. And let me know if you post it, will help you out :)
There is no finaldf[col] already in your df after merge. This loop is actually creating it. After merge you have finaldf[col+'_x'] and finaldf[col+'_y']. Now pd.merge already took care of copying values from DF2 to DF1 matching on ID. These are stored in finaldf[col+'_y']. Where they did not match, you wanted to keep DF1's values right. That's what this python line is doing in loop. And in that process, since you do not want '_x' and '_y' columns in finaldf, I basically created finaldf['col'] to store final data. It was not in your merged finaldf that you can put condition on.
|
0

if you want the values for those 18 columns (say col1,col2...col18) from DF2 only, you can do

cols_18 = ["col1",col2"....]
cols_to_use = list(set(DF1.columns) - set(cols_18))
pd.merge(DF1[cols_to_use],DF2...), on = 'ID', how ='left')

If you want to keep the columns from both dataframes, the default suffixes are _x and _y. but you can override them like following

pd.merge(DF1,DF2...), on = 'ID', how ='left, suffixes = ["","_new"])

Now there will be 109 columns, but the main dataframe's column names stay intact. The columns from the DF2 have a suffix of "_new"

3 Comments

but this method will only get values from DF2 for those 18 columns, so rows where 'ID' column does not match, those rows will have NaNs in these 18 columns.
if we removed 18 columns from DF1 and we got [cols_to_use] , this would mean we got rid of rows information that unmatched with DF2 . so technically I would need these rows in DF1 after merging based on a common value.
I am not able to understand the question exactly. can you probably provide an example of say DataFrames of size 2x2 and show what you want?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.