I have a large dataframe:
import pandas as pd
df = pd.read_csv('data.csv)
df.head()
ID Year status
223725 1991 No
223725 1992 No
223725 1993 No
223725 1994 No
223725 1995 No
I have many unique IDs and I want to remove duplicate rows based on the columns ID and status.
If an
IDhas a value ofYesinstatusthen only that row is retained, all other rows with astatusvalue ofNoare removed for that specificID.If an
IDhasNoin every observation instatusthen retain any row specific to thatID.
For example, in the DataFrame below, only the row where 68084329 has a value of Yes in status should be kept i.e. the last row, all other rows with No are dropped.
ID Year status
68084329 1991 No
68084329 1992 No
68084329 1993 No
68084329 1994 No
68084329 1995 No
68084329 1996 No
68084329 1997 No
68084329 1998 No
68084329 1999 No
68084329 2000 No
68084329 2001 No
68084329 2002 No
68084329 2003 No
68084329 2004 No
68084329 2005 No
68084329 2006 No
68084329 2007 No
68084329 2008 No
68084329 2010 No
68084329 2011 No
68084329 2012 Yes
How to I drop duplicate rows according to the above conditions?
df.head(10).to_dict('list')