Remove duplicates in dataframe pandas based on values of two columns

Question

I have a dataframe of customers with some items, which looks like this:

Customer ID     Item
     1         Banana
     1         Apple
     2         Orange
     3         Grape
     4         Banana
     4         Apple
     5         Orange
     5         Grape
     6         Orange

What I'm willing to do is to remove all duplicates customers with same items, so the results should look like this:

Customer ID     Item
     1         Banana
     1         Apple
     2         Orange
     3         Grape
     5         Orange
     5         Grape

As customer 4 has the same items as customer 1. Also customer 6 with 2.

Thanks in advance for your help!

akuiper · Accepted Answer · 2017-03-29 14:35:30Z

3

Not sure if this is what you means. But if you mean duplicates based on the items, you can collect the items for each customer as a frozenset (if unique), or tuple (if not unique), and then apply drop_duplicates; later on do a filter on the original data frame based on the customer ID.

df[df["Customer ID"].isin(df.groupby("Customer ID").Item.apply(frozenset).drop_duplicates().index)]

Or if items are not unique and order doesn't matter:

df[df["Customer ID"].isin(df.groupby("Customer ID")
                            .Item.apply(lambda x: tuple(sorted(x)))
                            .drop_duplicates().index)]

edited Mar 29, 2017 at 14:35

answered Mar 29, 2017 at 14:30

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Remove duplicates in dataframe pandas based on values of two columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related