pandas mergeing two dataframes without duplicate rows

Question

I have two data frames:

test1 = pd.DataFrame({'Gene':['WASH7P', 'WASH7P', 'VCZ'], 'TPM':[10.034, 0.234000, 2.345]})
test2 = pd.DataFrame({'Gene':['WASH7P', 'WASH7P', 'btt'], 'TPM':[1.12345, 2.300, 0.00000]})

I would like to merge them into a single data frame. I have tried:

df = pd.merge(test1,test2, on = ['Gene'],how = 'outer')

resulting in:

    Gene    TPM_x   TPM_y
0   WASH7P  10.034  1.12345
1   WASH7P  10.034  2.30000
2   WASH7P  0.234   1.12345
3   WASH7P  0.234   2.30000
4   VCZ     2.345   NaN
5   btt     NaN     0.00000

However, there are row duplicates. I have tried drop_duplicates() but this does not work. The real data frames are much larger with > 30,000 rows.

The desired output:

    Gene    TPM_x   TPM_y
    WASH7P  10.034  1.12345
    WASH7P  0.234   2.30000
    VCZ     2.345   NaN
    btt     NaN     0.00000

Any help would be great.

These aren't really duplicates - the values of TPM_x and TPM_y differ in the "duplicate" rows. You should try combine first — forgetso
– forgetso, Commented Feb 19, 2021 at 16:45

rogercake · Accepted Answer · 2021-02-19 16:47:45Z

1

If you are trying to drop duplicates based on column "TPM_x"

use this

df = pd.merge(test1,test2, on = ['Gene'],how = 'outer').drop_duplicates(keep="first", subset = 'TPM_x')

answered Feb 19, 2021 at 16:47

rogercake

732 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chip Over a year ago

I found this resulted in the correct format except for TPM_y where WASH7P had only 1.2345 for both rows.

Collectives™ on Stack Overflow

pandas mergeing two dataframes without duplicate rows

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related