2

I am doing a dataframe outer join using multiple columns:

DF1:

ColumnA ColumnB ColumnC ColumnD
1          2      3        4
1          2      3        4

DF2:

ColumnE ColumnF ColumnG ColumnH
1          2      3        4
1          2      3        4

Merging code:

df= pd.merge(DF1, DF2, left_on=['ColumnA','ColumnB','ColumnC','ColumnD'], right_on=['ColumnE','ColumnF','ColumnG','ColumnH'], how='outer')

Actual outcome:

ColumnA ColumnB ColumnC ColumnD ColumnE ColumnF ColumnG ColumnH
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4

Expected outcome(the values should display only twice as the combination of columns matches exactly in two datasets):

ColumnA ColumnB ColumnC ColumnD ColumnE ColumnF ColumnG ColumnH
1        2       3       4         1      2       3       4
1        2       3       4         1      2       3       4

Can someone advise where am I going wrong?

1
  • This happens because there are duplicate values ​​in each column. If your data frame had 3 rows instead of 2 then 9 rows would appear instead of 4. please check my answer:) Commented Nov 19, 2019 at 1:10

2 Answers 2

2

You have identical duplicates on both df1 and df2, so the merged df got number of rows double for each duplicate. Simple solution is keep one dataframe unique by drop_duplicates and merge

df = pd.merge(df1.drop_duplicates(), df2, left_on=['ColumnA','ColumnB' ,'ColumnC','ColumnD'], right_on=['ColumnE','ColumnF','ColumnG','ColumnH'], how='outer')

Out[742]:
   ColumnA  ColumnB  ColumnC  ColumnD  ColumnE  ColumnF  ColumnG  ColumnH
0        1        2        3        4        1        2        3        4
1        1        2        3        4        1        2        3        4
Sign up to request clarification or add additional context in comments.

Comments

2

So we need merge with a additional key , created by cumcount

df1=df1.assign(Key=df1.groupby(list(df1)).cumcount())
df2=df2.assign(Key=df1.groupby(list(df1)).cumcount()

df1.merge(df2, left_on=['ColumnA','ColumnB','ColumnC','ColumnD','Key'],
               right_on=['ColumnE','ColumnF','ColumnG','ColumnH','Key'], how='outer')
Out[19]: 
   ColumnA  ColumnB  ColumnC  ColumnD  Key  ColumnE  ColumnF  ColumnG  ColumnH
0        1        2        3        4    0        1        2        3        4
1        1        2        3        4    1        1        2        3        4

1 Comment

I never think about adding an additional key :) +1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.