Merge dataframes Python without duplications

Question

I have two dataframes df1 and df2 and I want to merge them.

Dataframe df1 is as follows:

   IDs          Value1      Value2       
   AB              1          3
   AB              1          1
   AB              2          4           
   BC              2          2
   BC              5          0         
   BG              1          1         
   RF              2          2

and dataframe df2 is as follows:

   IDs          Issue     
   AB              AA
   AB              AAA
   AB              BA
   BC              CC
   BC              CA    
   BG              A        
   RF              D

and the desired output is df3:

   IDs          Value1      Value2        Issue     
   AB              1          3             AA
   AB              1          1             AAA
   AB              2          4             BA
   BC              2          2             CC
   BC              5          0             CA
   BG              1          1             A
   RF              2          2             D

Currently, the following:

df3 = pd.merge(df1,df2,left_on='IDs',right_on='IDs',how='inner')
df3 = pd.merge(df1,df2,left_on='IDs',right_on='IDs',how='left')
df3 = pd.merge(df1,df2,left_on='IDs',right_on='IDs',how='outer')

do not work, since they produce a result similar to the following:

   IDs          Value1      Value2        Issue     
   AB              1          3             AA
   AB              1          1             AA
   AB              2          4             AA
   BC              2          2             CC
   BC              5          0             CC
   BG              1          1             A
   RF              2          2             D

meaning that they duplicate the first value of the Issue field from df2.

jezrael · Accepted Answer · 2018-09-24 14:11:24Z

4

Use cumcount for counter column in both DataFrames and add this column to parameter on in merge:

df1['g'] = df1.groupby('IDs').cumcount()
df2['g'] = df2.groupby('IDs').cumcount()

df3 = pd.merge(df1,df2,on=['IDs', 'g']).drop('g', axis=1)
print (df3)
  IDs  Value1  Value2 Issue
0  AB       1       3    AA
1  AB       1       1   AAA
2  AB       2       4    BA
3  BC       2       2    CC
4  BC       5       0    CA
5  BG       1       1     A
6  RF       2       2     D

Details:

print (df1)
  IDs  Value1  Value2  g
0  AB       1       3  0
1  AB       1       1  1
2  AB       2       4  2
3  BC       2       2  0
4  BC       5       0  1
5  BG       1       1  0
6  RF       2       2  0

print (df2)
  IDs Issue  g
0  AB    AA  0
1  AB   AAA  1
2  AB    BA  2
3  BC    CC  0
4  BC    CA  1
5  BG     A  0
6  RF     D  0

edited Sep 24, 2018 at 14:11

answered Sep 24, 2018 at 13:19

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user37143 Over a year ago

This solution does not seem to work. I keep getting exactly the same issue described in my question above.

jezrael Over a year ago

@user37143 - Really interesting, for me it working very nice.

jezrael Over a year ago

@user37143 - Added output of columns g - for duplicated values is incrementing integers - can you check it?

user37143 Over a year ago

@jezrael yes that's my output as well, but the join still messes things up, returning the first occurence of "Issue" duplicated

jezrael Over a year ago

@user37143 - It should working, maybe possible problem some whitespaces in ids column or not same types. Thank you.

|

Scott Boston · Accepted Answer · 2018-09-24 13:26:08Z

2

You can use pd.concat to literally join by the index of the dataframe. This means both of your dataframes have to be preordered and you simply "pasting" one dataframe next to the other.

pd.concat([df1, df2[['Issue']], axis=1)

Output:

  IDs  Value1  Value2 Issue
0  AB       1       3    AA
1  AB       1       1   AAA
2  AB       2       4    BA
3  BC       2       2    CC
4  BC       5       0    CA
5  BG       1       1     A
6  RF       2       2     D

edited Sep 24, 2018 at 13:26

answered Sep 24, 2018 at 13:25

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Collectives™ on Stack Overflow

Merge dataframes Python without duplications

2 Answers 2

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related