1

I have a dataframe with 3 columns. I would like to drop duplicates in column A based on values in other columns. I have searched tirelessly and cant find a solution like this.

example:

A B C
Family1 nan nan
Family1 nan 1234
Family1 1245 nan
Family1 3456 78787
Family2 nan nan
Family3 nan nan

Basically i want to drop a duplicate ONLY IF the rest of the columns are both nan. otherwise, the duplicate can stay.

desired output:

A B C
Family1 nan 1234
Family1 1245 nan
Family1 3456 78787
Family2 nan nan
Family3 nan nan

Family2 and Family3 remain in the df because they dont have duplicates, even though both columns are nan

2
  • 2
    can you include the code that creates of a dataframe of the source table? Commented Jan 21, 2021 at 23:05
  • 2
    df = pd.DataFrame({'A':['Family1','Family1','Family1','Family1','Family2','Family3'],'B':[np.nan,np.nan,1245,3456,np.nan,np.nan],'C':[1234,np.nan,78787,np.nan,np.nan,np.nan]}) Commented Jan 21, 2021 at 23:09

2 Answers 2

3

You were not very clear. I suspect you want to drop any duplicates in column A if both columns B and C are NaN. If so, please try;

df[~(df.A.duplicated(keep=False)&(df.B.isna()&df.C.isna()))]
Sign up to request clarification or add additional context in comments.

Comments

3

try a double boolean, this returns true for all duplicates & true for any column after ['A'] that are all nulls. If both conditions are met we will exclude this using the ~ operator which inverts a boolean.

df[~(df.duplicated(subset=['A'],keep=False) & df.iloc[:,1:].isna().all(1))]

          A     B        C
1  Family1    NaN     1234
2  Family1   1245      NaN
3  Family1   3456    78787
4  Family2    NaN      NaN
5  Family3    NaN      NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.