I am trying to merge several DataFrames based on a common column. This will be done in a loop and the original DataFrame may not have all of the columns so an outer merge will be necessary. However when I do this over several different DataFrames columns duplicate with suffix _x and _y. I am looking for one DataFrame where the data is filled in and columns are added only if they did not previously exists.
df1=pd.DataFrame({'Company Name':['A','B','C','D'],'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
Company Name Data1 Data2
0 A 1 13
1 B 34 54
2 C 23 5354
3 D 66 443
A second DataFrame with additional information for some of the companies:
pd.DataFrame({'Company Name':['A','B'],'Address': ['str1', 'str2'], 'Phone': ['str1a', 'str2a']})
Company Name Address Phone
0 A str1 str1a
1 B str2 str2a
If I wanted to combine these two it will successfully merge into one using on=Column:
df1=pd.merge(df1,df2, on='Company Name', how='outer')
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 NaN NaN
3 D 66 443 NaN NaN
However if I were to do this same command again in a loop, or if I were to merge with another DataFrame with other company information I end up getting duplicate columns similar to the following:
df1=pd.merge(df1,pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']}), on='Company Name', how='outer')
Company Name Data1 Data2 Address_x Phone_x Address_y Phone_y
0 A 1 13 str1 str1a NaN NaN
1 B 34 54 str2 str2a NaN NaN
2 C 23 5354 NaN NaN str3 str3a
3 D 66 443 NaN NaN NaN NaN
When what I really want is one DataFrame with the same columns, just populating any missing data.
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 str3 str3a
3 D 66 443 NaN NaN
Thanks in advance. I have reviewed the previous questions asked here on duplicate columns as well as a review of the Pandas documentation with out any progress.
updatepd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index(). This is not the most efficient method but without more context on why you want to do this, it should work good enough. Could you for example first concatenate all the other dataframes in the loop and then merge todf1or you need thedf1updated at each loop to perform any code?