Pandas dataframe join using mutiple columns

Question

I am doing a dataframe outer join using multiple columns:

DF1:

ColumnA ColumnB ColumnC ColumnD
1          2      3        4
1          2      3        4

DF2:

ColumnE ColumnF ColumnG ColumnH
1          2      3        4
1          2      3        4

Merging code:

df= pd.merge(DF1, DF2, left_on=['ColumnA','ColumnB','ColumnC','ColumnD'], right_on=['ColumnE','ColumnF','ColumnG','ColumnH'], how='outer')

Actual outcome:

ColumnA ColumnB ColumnC ColumnD ColumnE ColumnF ColumnG ColumnH
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4

Expected outcome(the values should display only twice as the combination of columns matches exactly in two datasets):

ColumnA ColumnB ColumnC ColumnD ColumnE ColumnF ColumnG ColumnH
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4

Can someone advise where am I going wrong?

This happens because there are duplicate values in each column. If your data frame had 3 rows instead of 2 then 9 rows would appear instead of 4. please check my answer:) — ansev
– ansev, Commented Nov 19, 2019 at 1:10

Andy L. · Accepted Answer · 2019-11-19 01:54:48Z

2

You have identical duplicates on both df1 and df2, so the merged df got number of rows double for each duplicate. Simple solution is keep one dataframe unique by drop_duplicates and merge

df = pd.merge(df1.drop_duplicates(), df2, left_on=['ColumnA','ColumnB' ,'ColumnC','ColumnD'], right_on=['ColumnE','ColumnF','ColumnG','ColumnH'], how='outer')

Out[742]:
   ColumnA  ColumnB  ColumnC  ColumnD  ColumnE  ColumnF  ColumnG  ColumnH
0        1        2        3        4        1        2        3        4
1        1        2        3        4        1        2        3        4

answered Nov 19, 2019 at 1:54

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BENY · Accepted Answer · 2019-11-19 01:49:56Z

2

So we need merge with a additional key , created by cumcount

df1=df1.assign(Key=df1.groupby(list(df1)).cumcount())
df2=df2.assign(Key=df1.groupby(list(df1)).cumcount()

df1.merge(df2, left_on=['ColumnA','ColumnB','ColumnC','ColumnD','Key'],
               right_on=['ColumnE','ColumnF','ColumnG','ColumnH','Key'], how='outer')
Out[19]: 
   ColumnA  ColumnB  ColumnC  ColumnD  Key  ColumnE  ColumnF  ColumnG  ColumnH
0        1        2        3        4    0        1        2        3        4
1        1        2        3        4    1        1        2        3        4

answered Nov 19, 2019 at 1:49

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Andy L. Over a year ago

I never think about adding an additional key :) +1

Collectives™ on Stack Overflow

Pandas dataframe join using mutiple columns

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related