Pandas DataFrame: Removing duplicate rows based on condition in columns

Question

I have a large dataframe:

import pandas as pd 
df = pd.read_csv('data.csv)

df.head()
ID  Year    status
223725  1991    No
223725  1992    No
223725  1993    No
223725  1994    No
223725  1995    No

I have many unique IDs and I want to remove duplicate rows based on the columns ID and status.

If an ID has a value of Yes in status then only that row is retained, all other rows with a status value of No are removed for that specific ID.
If an ID has No in every observation in status then retain any row specific to that ID.

For example, in the DataFrame below, only the row where 68084329 has a value of Yes in status should be kept i.e. the last row, all other rows with No are dropped.

 ID         Year    status
68084329    1991    No
68084329    1992    No
68084329    1993    No
68084329    1994    No
68084329    1995    No
68084329    1996    No
68084329    1997    No
68084329    1998    No
68084329    1999    No
68084329    2000    No
68084329    2001    No
68084329    2002    No
68084329    2003    No
68084329    2004    No
68084329    2005    No
68084329    2006    No
68084329    2007    No
68084329    2008    No
68084329    2010    No
68084329    2011    No
68084329    2012    Yes

How to I drop duplicate rows according to the above conditions?

Please get used to providing sample df as a callable line of code, you could create a dummy df or get it from your original data with df.head(10).to_dict('list') — RichieV
– RichieV, Commented Sep 3, 2020 at 16:38

YOLO · Accepted Answer · 2020-09-03 16:34:09Z

4

I think you can do:

# sort by status so that No comes before Yes
df = df.sort_values('status')

# pick the last row, it will either be Yes or No
df = df.groupby('ID').last()

answered Sep 3, 2020 at 16:34

YOLO

22k5 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RichieV Over a year ago

beat me to it! Just a warning for @MIMA if there are two rows with Yes for the same ID this will only keep one of them

MI MA Over a year ago

Thank you both for the input. Luckily there's no rows with Yes for the same ID.

Collectives™ on Stack Overflow

Pandas DataFrame: Removing duplicate rows based on condition in columns

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related