0

I have two data frames:

test1 = pd.DataFrame({'Gene':['WASH7P', 'WASH7P', 'VCZ'], 'TPM':[10.034, 0.234000, 2.345]})
test2 = pd.DataFrame({'Gene':['WASH7P', 'WASH7P', 'btt'], 'TPM':[1.12345, 2.300, 0.00000]})

I would like to merge them into a single data frame. I have tried:

df = pd.merge(test1,test2, on = ['Gene'],how = 'outer')

resulting in:

    Gene    TPM_x   TPM_y
0   WASH7P  10.034  1.12345
1   WASH7P  10.034  2.30000
2   WASH7P  0.234   1.12345
3   WASH7P  0.234   2.30000
4   VCZ     2.345   NaN
5   btt     NaN     0.00000

However, there are row duplicates. I have tried drop_duplicates() but this does not work. The real data frames are much larger with > 30,000 rows.

The desired output:

    Gene    TPM_x   TPM_y
    WASH7P  10.034  1.12345
    WASH7P  0.234   2.30000
    VCZ     2.345   NaN
    btt     NaN     0.00000

Any help would be great.

1
  • These aren't really duplicates - the values of TPM_x and TPM_y differ in the "duplicate" rows. You should try combine first Commented Feb 19, 2021 at 16:45

1 Answer 1

1

If you are trying to drop duplicates based on column "TPM_x"

use this

df = pd.merge(test1,test2, on = ['Gene'],how = 'outer').drop_duplicates(keep="first", subset = 'TPM_x')
Sign up to request clarification or add additional context in comments.

1 Comment

I found this resulted in the correct format except for TPM_y where WASH7P had only 1.2345 for both rows.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.