1

I need to identify duplicate rows based on multiple columns in a Dataframe. The remaining column (PKID - which has Integer values) should merge as a list of integers. Example : Input data :(rows 0 & 1 are duplicates except for PKID column)

  Col1  PKID   SUBJECT ID
0  A    58305    ABC    X1
1  A    57011    ABC    X1
2  B    12345    XYZ    X1

Expected result :

  Col1   PKID            SUBJECT ID
0  A    [58305,57011]    ABC    X1
1  B    12345            XYZ    X1

So if all columns except PKID have duplicates, merge all entries as 1 with PKID values being List of Integers.

How can this be achieved ?

2
  • Do you know what column values can be duplicated? Commented Feb 28, 2018 at 19:53
  • Yes, except for Column name PKID, if all other columns have same values, then merge the rows as 1 and make the PKID column's value as a List of integers. Commented Feb 28, 2018 at 19:58

1 Answer 1

1

You'll want a groupby + apply:

df.groupby(df.columns.difference(['PKID']).tolist())\
                 .PKID.apply(pd.Series.unique).reset_index()

  Col1  ID SUBJECT            PKID
0    A  X1     ABC  [58305, 57011]
1    B  X1     XYZ         [12345]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.