Replace the values of all duplicate rows in Python Pandas

Question

I have a data frame df where some rows are duplicates with respect to a subset of columns:

df

A       B        C       D
1       Blue     Green   4
2       Red      Green   6
3       Red      Green   2
4       Blue     Pink    6
5       Blue     Orange  9
6       Blue     Orange  8
7       Blue     Red     8
8       Red      Orange  9

I would like to replace values for duplicate rows with respect to B and C and replace the whole row by 'ERR', ideally producing:

A       B        C       D
1       Blue     Green   4
ERR     ERR      ERR     ERR
ERR     ERR      ERR     ERR
4       Blue     Pink    6
ERR     ERR      ERR     ERR
ERR     ERR      ERR     ERR
7       Blue     Red     8
8       Red      Orange  9

So brief: If there are duplicate rows for columns B and C, all the values in those rows should be set to 'ERR' (not only the duplicate ones).

Solved! -> thanks to @anky_91

df = pd.DataFrame({"A": [1,2,3,4,5,6,7,8], "B": ['Blue', 'Red', 'Red', 'Blue', 'Blue', 'Blue', 'Blue', 'Red'], "C": ['Green', 'Green', 'Green', 'Pink', 'Orange', 'Orange', 'Red', 'Orange'], "D": [4,6,2,6,9,8,8,9]})

df = df.mask(df.duplicated(['B','C'], keep=False), 'ERR')
print(df)

What would the 'ERR' mean? Could NaN or None be more appropriate? — AMC
– AMC, Commented Feb 16, 2020 at 4:55
Well I want to keep track of the 'errors' in the dataframe (if two rows have the same value for every column, there is no problem, I just erase one of but but if, two rows in B and C have the same paired value but the values in the others column are different then this is seen as an error in the dataframe which I would want to keep track of). I'm working with databases, I'm using np.NaN already for the gaps in the database, there needs to be a difference between 'error' en 'gap' my database ;-) — Matthi9000
– Matthi9000, Commented Feb 16, 2020 at 8:06

anky · Accepted Answer · 2020-02-15 16:19:27Z

1

You can use df.mask here with df.duplicated

df.mask(df.duplicated(['B','C'],keep=False),'ERR')

     A     B       C    D
0    1  Blue   Green    4
1  ERR   ERR     ERR  ERR
2  ERR   ERR     ERR  ERR
3    4  Blue    Pink    6
4  ERR   ERR     ERR  ERR
5  ERR   ERR     ERR  ERR
6    7  Blue     Red    8
7    8   Red  Orange    9

Collectives™ on Stack Overflow

Replace the values of all duplicate rows in Python Pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related