5

I am trying to merge several DataFrames based on a common column. This will be done in a loop and the original DataFrame may not have all of the columns so an outer merge will be necessary. However when I do this over several different DataFrames columns duplicate with suffix _x and _y. I am looking for one DataFrame where the data is filled in and columns are added only if they did not previously exists.

df1=pd.DataFrame({'Company Name':['A','B','C','D'],'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
      Company Name  Data1  Data2
0            A      1     13
1            B     34     54
2            C     23   5354
3            D     66    443

A second DataFrame with additional information for some of the companies:

pd.DataFrame({'Company Name':['A','B'],'Address':  ['str1', 'str2'], 'Phone': ['str1a', 'str2a']})

  Company Name Address  Phone
0            A    str1  str1a
1            B    str2  str2a

If I wanted to combine these two it will successfully merge into one using on=Column:

df1=pd.merge(df1,df2, on='Company Name', how='outer')

  Company Name  Data1  Data2 Address  Phone
0            A      1     13    str1  str1a
1            B     34     54    str2  str2a
2            C     23   5354     NaN    NaN
3            D     66    443     NaN    NaN

However if I were to do this same command again in a loop, or if I were to merge with another DataFrame with other company information I end up getting duplicate columns similar to the following:

df1=pd.merge(df1,pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']}), on='Company Name', how='outer')
  Company Name  Data1  Data2 Address_x Phone_x Address_y Phone_y
0            A      1     13      str1   str1a       NaN     NaN
1            B     34     54      str2   str2a       NaN     NaN
2            C     23   5354       NaN     NaN      str3   str3a
3            D     66    443       NaN     NaN       NaN     NaN

When what I really want is one DataFrame with the same columns, just populating any missing data.

  Company Name  Data1  Data2 Address  Phone
0            A      1     13    str1  str1a
1            B     34     54    str2  str2a
2            C     23   5354    str3  str3a
3            D     66    443     NaN    NaN

Thanks in advance. I have reviewed the previous questions asked here on duplicate columns as well as a review of the Pandas documentation with out any progress.

4
  • 3
    I think you are looking for update Commented Dec 21, 2018 at 20:23
  • Update only aligns on index which most likely is not going to be the same and in addition would not allow the update of columns that are not in the original DataFrame. Commented Dec 21, 2018 at 20:46
  • 2
    @Epic_Test maybe try using pd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index(). This is not the most efficient method but without more context on why you want to do this, it should work good enough. Could you for example first concatenate all the other dataframes in the loop and then merge to df1 or you need the df1 updated at each loop to perform any code? Commented Dec 21, 2018 at 22:36
  • @Ben.T This works exactly like I am looking for. I am working around with it to make sure that there are not any unanticipated effects. If you want to put this as an answer I will mark as an accepted answer. I am ok with this not being the most efficient process. The loop I am running would require that I add to the main df as I iterate instead of doing all at once and at the time I do not know how many iterations will occur or what columns will be present. Thanks for your help! Commented Jan 1, 2019 at 3:54

2 Answers 2

1

As you look for merging one dataframe at the time in a loop for, here is a way you can do it, that the new dataframe has new company name or not, new column or not:

df1 = pd.DataFrame({'Company Name':['A','B','C','D'],
                    'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
list_dfo = [pd.DataFrame({'Company Name':['A','B'],
                          'Address':  ['str1', 'str2'], 'Phone': ['str1a', 'str2a']}),
            pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']})]

for df_other in list_dfo:
    df1 = pd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index()
    # and other code

At the end in this example:

print(df1)
 Company Name  Data1   Data2 Address  Phone
0            A    1.0    13.0    str1  str1a
1            B   34.0    54.0    str2  str2a
2            C   23.0  5354.0    str3  str3a
3            D   66.0   443.0     NaN    NaN

Instead of first, you can use last, which would keep the last valid value and not the first in each column per group, it depends on what data you need, the one from df1 or the one from df_other if available. In the example above, it does not change anything, but in the following case you will see:

#company A has a new address
df4 = pd.DataFrame({'Company Name':['A'],'Address':['new_str1']})

#first keep the value from df1
print(pd.merge(df1,df4,how='outer').groupby('Company Name')
        .first().reset_index())
Out[21]: 
  Company Name  Data1   Data2 Address  Phone
0            A    1.0    13.0    str1  str1a   #address is str1 from df1
1            B   34.0    54.0    str2  str2a
2            C   23.0  5354.0    str3  str3a
3            D   66.0   443.0     NaN    NaN

#while last keep the value from df4
print (pd.merge(df1,df4,how='outer').groupby('Company Name')
         .last().reset_index())
Out[22]: 
  Company Name  Data1   Data2   Address  Phone
0            A    1.0    13.0  new_str1  str1a   #address is new_str1 from df4
1            B   34.0    54.0      str2  str2a
2            C   23.0  5354.0      str3  str3a
3            D   66.0   443.0       NaN    NaN
Sign up to request clarification or add additional context in comments.

Comments

0

IIUC, you might try this;

def update_df(df1, df_next):
    if 'Company Name' not in list(df1):
        pass
    else:
        df1.set_index('Company Name', inplace=True)
    df_next.set_index('Company Name', inplace=True)   
    new_cols = [item for item in set(df_next) if item not in set(df1)]
    for col in new_cols:
        df1['{}'.format(col)] = col
    df1.update(df_next) 

update_df(df1, df2)
update_df(df1, df3)
df1

              Data1  Data2  Address  Phone
Company Name                              
A                 1     13     str1  str1a
B                34     54     str2  str2a
C                23   5354     str3  str3a
D                66    443  Address  Phone

note1; for being able to use df.update you have to set_index to 'Company Name', this function will check that for df1 once and a next time it will pass. The df added will have the index set to 'Company Name'.

note2; next the function will check whether there are new columns, add them and fill out with the column name (you might want to change that).

note3; lastly you perform df.update with the values you need.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.