4

I have a data frame df where some rows are duplicates with respect to a subset of columns:

A    B     C
1    Blue  Green
2    Red   Green
3    Red   Green
4    Blue  Orange
5    Blue  Orange

I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B and C, without deleting the whole row, ideally producing:

A    B     C
1    Blue  Green
2    Red   Green
3    NaN   NaN
4    Blue  Orange
5    Nan   NaN

As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated, however I can't get it to work with duplicates in a subset of columns.

I've also played around with:

is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999)  # 999 intended as a placeholder that I could find-and-replace later on

However this replaces almost every row with 999 in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!

3 Answers 3

6

df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan seems to work for me.

Edited to include @ALollz and @macaw_9227 correction.

Sign up to request clarification or add additional context in comments.

1 Comment

If you do df.loc[df.duplicated(subset=['B','C']),['B','C'] = np.NaN you keep the values of 'A' and onlly remove 'B' and 'C'
2

Let me share with you how I used to confront those kind of challenges in the beginning. Obviously, there are quicker ways (a one-liner) but for the sake of the answer, let's do it on a more intuitive level (later, you'll see that you can do it in one line).

So here we go...

df = pd.DataFrame({"B":['Blue','Red','Red','Blue','Blue'],"C":['Green','Green','Green','Orange','Orange']})

which result in

enter image description here

Step 1: identify the duplication:

For this, I'm simply adding another (facilitator) column and asking with True/False if B and C are duplicated.

df['IS_DUPLICATED']= df.duplicated(subset=['B','C'])

enter image description here

Step 2: Identify the indexes of the 'True' IS_DUPLICATED:

dup_index = df[df['IS_DUPLICATED']==True].index

result: Int64Index([2, 4], dtype='int64')

Step 3: mark them as Nan:

df.iloc[dup_index]=np.NaN

enter image description here

Step 4: remove the IS_DUPLICATED column:

df.drop('IS_DUPLICATED',axis=1, inplace=True)

and the desired result:

enter image description here

2 Comments

Thanks for taking the time to address this in detail! I tried implementing this in a more complex dataframe (i.e. with more columns that the one described above) and it seems to preserve the row, but deletes values in all columns (i.e. not just the specified subset)
@Lyam, with pleasure :-) if this analysis works on this sample dataframe, it should work on the bigger dataframe. I suspect that something is inconsistent with the indexes - check on your end if you got it right. Again, if it works on this set it should work on the entire set. Hope this helps.
1

I will using

df[['B','C']]=df[['B','C']].mask(df.duplicated(['B','C']))
df
Out[141]: 
   A     B       C
0  1  Blue   Green
1  2   Red   Green
2  3   NaN     NaN
3  4  Blue  Orange
4  5   NaN     NaN

1 Comment

If I would like to change the duplicates to another value (say: 'DUPLI') instead of 'NaN', how would you do this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.