2

I'm merging 5 files. Each file has certain column names which are same across the other files. I used this answer to perform the concat operation. There is one column which is unique in all files i.e ID. When I merge the columns I'm noticing the suffix(_x, _y) being applied to end of the column names(due to merge functionality) which are repeating. How can I merge these columns into a single column (Irrespective of null values).

eg:

Dataframe 1
ID    Name    Age    LAN_NBR
1     ABC     24     G284992

Dataframe 2
ID    Name    Street     City     State    TYPE
2     John    Wacker     Chicago  IL       HUB

Dataframe 3
ID    CLOSE_DATE    TYPE
3     1/1/2021      HUB

Dataframe 4
ID    TYPE    LAN_NBR
1     HUB     G284992

Expected Output

ID    Name    Age    LAN_NBR    Street     City     State    TYPE    CLOSE_DATE
1     ABC     24     G284992                                 HUB
2     John                      Wacker     Chicago  IL       HUB
3                                                            HUB     1/1/2021

Code

obj1=pd.read_excel("file1.xlsx")
obj2=pd.read_excel("file2.xlsx")
obj3=pd.read_excel("file3.xlsx")
obj4=pd.read_excel("file4.xlsx")
obj5=pd.read_excel("file5.xlsx")

obj1_ID=pd.DataFrame(obj1["ID"])
obj2_ID=pd.DataFrame(obj2["ID"])
obj3_ID=pd.DataFrame(obj3["ID"])
obj4_ID=pd.DataFrame(obj4["ID"])
obj5_ID=pd.DataFrame(obj5["ID"])

concat_pd=[obj1_ID,obj2_ID,obj3_ID,obj4_ID,obj5_ID]
obj_final=pd.concat(concat_pd).fillna('')

obj_final.obj_final.drop_duplicates(subset='ID',keep='first')

merge1=pd.merge(left=obj_final, right=obj1, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj2, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj3, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj4, on="ID", how="left")
merge1=pd.merge(left=obj_final, right=obj5, on="ID", how="left")
1
  • You could set suffix = (None, "_drop") and then reindex the final dataframe to the union of your other dataframe column names. Commented Apr 13, 2020 at 20:46

2 Answers 2

2

Option 1: concat and drop_duplicates:

(pd.concat( (df1, df2, df3, df4), sort=False)
   .groupby('ID')
   .bfill()
   .drop_duplicates('ID')
   .reset_index(drop=True)
)

Output:

   ID  Name   Age  LAN_NBR  Street     City State TYPE CLOSE_DATE
0   1   ABC  24.0  G284992     NaN      NaN   NaN  HUB        NaN
1   2  John   NaN      NaN  Wacker  Chicago    IL  HUB        NaN
2   3   NaN   NaN      NaN     NaN      NaN   NaN  HUB   1/1/2021

Option 2: use combine_first as @ansev suggested:

ret_df = pd.DataFrame()

for d in (df1, df2, df3, df4):
    ret_df = ret_df.combine_first(d.set_index('ID') )

ret_df = ret_df.reset_index() 

Output:

   ID   Age CLOSE_DATE     City  LAN_NBR  Name State  Street TYPE
0   1  24.0        NaN      NaN  G284992   ABC   NaN     NaN  HUB
1   2   NaN        NaN  Chicago      NaN  John    IL  Wacker  HUB
2   3   NaN   1/1/2021      NaN      NaN   NaN   NaN     NaN  HUB
Sign up to request clarification or add additional context in comments.

Comments

2

you can use concat and than groupby with first to keep the first non null value available if any per ID.

print (pd.concat([df1, df2, df3, df4]).groupby('ID').first())
    Name   Age  LAN_NBR  Street     City State TYPE CLOSE_DATE
ID                                                            
1    ABC  24.0  G284992     NaN      NaN   NaN  HUB        NaN
2   John   NaN      NaN  Wacker  Chicago    IL  HUB        NaN
3    NaN   NaN      NaN     NaN      NaN   NaN  HUB   1/1/2021

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.