Merge Multiple Duplicate rows based on multiple columns in Pandas.Dataframe

Question

I need to identify duplicate rows based on multiple columns in a Dataframe. The remaining column (PKID - which has Integer values) should merge as a list of integers. Example : Input data :(rows 0 & 1 are duplicates except for PKID column)

  Col1  PKID   SUBJECT ID
0  A    58305    ABC    X1
1  A    57011    ABC    X1
2  B    12345    XYZ    X1

Expected result :

  Col1   PKID            SUBJECT ID
0  A    [58305,57011]    ABC    X1
1  B    12345            XYZ    X1

So if all columns except PKID have duplicates, merge all entries as 1 with PKID values being List of Integers.

How can this be achieved ?

Yes, except for Column name PKID, if all other columns have same values, then merge the rows as 1 and make the PKID column's value as a List of integers. — Shankar Pandey
– Shankar Pandey, Commented Feb 28, 2018 at 19:58

cs95 · Accepted Answer · 2018-02-28 20:02:20Z

1

You'll want a groupby + apply:

df.groupby(df.columns.difference(['PKID']).tolist())\
                 .PKID.apply(pd.Series.unique).reset_index()

  Col1  ID SUBJECT            PKID
0    A  X1     ABC  [58305, 57011]
1    B  X1     XYZ         [12345]

answered Feb 28, 2018 at 20:02

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Merge Multiple Duplicate rows based on multiple columns in Pandas.Dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related