Drop or replace values within duplicate rows in pandas dataframe

Question

I have a data frame df where some rows are duplicates with respect to a subset of columns:

A    B     C
1    Blue  Green
2    Red   Green
3    Red   Green
4    Blue  Orange
5    Blue  Orange

I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B and C, without deleting the whole row, ideally producing:

A    B     C
1    Blue  Green
2    Red   Green
3    NaN   NaN
4    Blue  Orange
5    Nan   NaN

As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated, however I can't get it to work with duplicates in a subset of columns.

I've also played around with:

is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999)  # 999 intended as a placeholder that I could find-and-replace later on

However this replaces almost every row with 999 in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!

Tom · Accepted Answer · 2019-05-06 23:46:23Z

6

df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan seems to work for me.

Edited to include @ALollz and @macaw_9227 correction.

edited May 6, 2019 at 23:46

answered May 6, 2019 at 23:35

Tom

1016 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

danielR9 Over a year ago

If you do df.loc[df.duplicated(subset=['B','C']),['B','C'] = np.NaN you keep the values of 'A' and onlly remove 'B' and 'C'

adhg · Accepted Answer · 2019-05-07 00:26:14Z

2

Let me share with you how I used to confront those kind of challenges in the beginning. Obviously, there are quicker ways (a one-liner) but for the sake of the answer, let's do it on a more intuitive level (later, you'll see that you can do it in one line).

So here we go...

df = pd.DataFrame({"B":['Blue','Red','Red','Blue','Blue'],"C":['Green','Green','Green','Orange','Orange']})

which result in

Step 1: identify the duplication:

For this, I'm simply adding another (facilitator) column and asking with True/False if B and C are duplicated.

df['IS_DUPLICATED']= df.duplicated(subset=['B','C'])

Step 2: Identify the indexes of the 'True' IS_DUPLICATED:

dup_index = df[df['IS_DUPLICATED']==True].index

result: Int64Index([2, 4], dtype='int64')

Step 3: mark them as Nan:

df.iloc[dup_index]=np.NaN

Step 4: remove the IS_DUPLICATED column:

df.drop('IS_DUPLICATED',axis=1, inplace=True)

and the desired result:

edited May 7, 2019 at 0:26

answered May 7, 2019 at 0:20

adhg

11k13 gold badges63 silver badges99 bronze badges

2 Comments

Lyam Over a year ago

Thanks for taking the time to address this in detail! I tried implementing this in a more complex dataframe (i.e. with more columns that the one described above) and it seems to preserve the row, but deletes values in all columns (i.e. not just the specified subset)

adhg Over a year ago

@Lyam, with pleasure :-) if this analysis works on this sample dataframe, it should work on the bigger dataframe. I suspect that something is inconsistent with the indexes - check on your end if you got it right. Again, if it works on this set it should work on the entire set. Hope this helps.

BENY · Accepted Answer · 2019-05-07 00:32:13Z

1

I will using

df[['B','C']]=df[['B','C']].mask(df.duplicated(['B','C']))
df
Out[141]: 
   A     B       C
0  1  Blue   Green
1  2   Red   Green
2  3   NaN     NaN
3  4  Blue  Orange
4  5   NaN     NaN

answered May 7, 2019 at 0:32

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Matthi9000 Over a year ago

If I would like to change the duplicates to another value (say: 'DUPLI') instead of 'NaN', how would you do this?

Collectives™ on Stack Overflow

Drop or replace values within duplicate rows in pandas dataframe

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related