Filter Pandas DataFrame using value_counts and multiple columns?

Question

I have a dataset of orders and people who have placed those orders. Orders have a unique identifier, and buyers have a unique identifier across multiple orders. Here's an example of that dataset:

| Order_ID | Order_Date | Buyer_ID |
|----------|------------|----------|
| 123421   | 01/01/19   | a213422  |
| 123421   | 01/01/19   | a213422  |
| 123421   | 01/01/19   | a213422  |
| 346345   | 01/03/19   | a213422  |
| 567868   | 01/05/19   | a346556  |
| 567868   | 01/05/19   | a346556  |
| 234534   | 01/10/19   | a678909  |

I want to be able to filter the dataset to individuals who have only placed one order, even if that order has multiple items:

| Order_ID | Order_Date | Buyer_ID |
|----------|------------|----------|
| 567868   | 01/05/19   | a346556  |
| 567868   | 01/05/19   | a346556  |
| 234534   | 01/10/19   | a678909  |

If I try df[df['Buyer_ID'].map(df['Buyer_ID'].value_counts()) == 1] I get a really weird situation where the resulting dataframe is only rows where there's a 1 to 1 relationship between Order_ID and Buyer_ID. Like this:

| Order_ID | Order_Date | Buyer_ID |
|----------|------------|----------|
| 346345   | 01/03/19   | a213422  |
| 234534   | 01/10/19   | a678909  |

In the result I want, Buyer_ID a213422 should not appear at all because that person has more than one Order_ID.

This leads me to believe that value_counts() is either not the appropriate way to perform this filter, or I'm doing it wrong. What would be the appropriate way to perform this filter?

They are just different Order_IDs, with a 1 to many relationship from Order ID to items ordered. They represent a single order. — kabaname
– kabaname, Commented Dec 5, 2019 at 20:57

ansev · Accepted Answer · 2019-12-06 13:12:59Z

4

Method 1: boolean indexing with groupby.transform

df[df.groupby('Buyer_ID')['Order_ID'].transform('nunique').eq(1)]

Method 2: Groupby.filter

df.groupby('Buyer_ID').filter(lambda x: x['Order_ID'].nunique()==1)

Method 3: boolean indexing with Series.map

df[df['Buyer_ID'].map(df.groupby('Buyer_ID')['Order_ID'].nunique().eq(1))]

Output

   Order_ID Order_Date Buyer_ID
4    567868   01/05/19  a346556
5    567868   01/05/19  a346556
6    234534   01/10/19  a678909

If you want to remove duplicates use DataFrame.drop_duplicates at the end:

df[df.groupby('Buyer_ID')['Order_ID'].transform('nunique').eq(1)].drop_duplicates()


   Order_ID Order_Date Buyer_ID
4    567868   01/05/19  a346556
6    234534   01/10/19  a678909

edited Dec 6, 2019 at 13:12

answered Dec 5, 2019 at 21:02

ansev

31k5 gold badges21 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ansev Over a year ago

I added drop_duplicates to complete my solution

Dylon · Accepted Answer · 2019-12-05 21:18:26Z

Here's another way you could do it:

import pandas as pd

# | Order_ID | Order_Date | Buyer_ID |
# |----------|------------|----------|
# | 123421   | 01/01/19   | a213422  |
# | 123421   | 01/01/19   | a213422  |
# | 123421   | 01/01/19   | a213422  |
# | 346345   | 01/03/19   | a213422  |
# | 567868   | 01/05/19   | a346556  |
# | 567868   | 01/05/19   | a346556  |
# | 234534   | 01/10/19   | a678909  |

df = pd.DataFrame.from_dict({
    "Order_ID": [123421, 123421, 123421, 346345, 567868, 567868, 234534],
    "Order_Date": ["01/01/19", "01/01/19", "01/01/19", "01/03/19", "01/05/19", "01/05/19", "01/10/19"],
    "Buyer_ID": ["a213422", "a213422", "a213422", "a213422", "a346556", "a346556", "a678909"],
})

buyers_with_one_order = df.groupby(["Buyer_ID"]) \
                          .agg(num_orders=("Order_ID", pd.Series.nunique)) \
                          .query("num_orders == 1") \
                          .reset_index() \
                          .Buyer_ID

filtered_df = df.merge(buyers_with_one_order).drop_duplicates()

print(filtered_df.to_string(index=False))

# | Order_ID | Order_Date | Buyer_ID |
# |----------|------------|----------|
# | 567868   | 01/05/19   | a346556  |
# | 234534   | 01/10/19   | a678909  |

Collectives™ on Stack Overflow

Filter Pandas DataFrame using value_counts and multiple columns?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related