0

I have a data frame df where some rows are duplicates with respect to a subset of columns:

df

A       B        C       D
1       Blue     Green   4
2       Red      Green   6
3       Red      Green   2
4       Blue     Pink    6
5       Blue     Orange  9
6       Blue     Orange  8
7       Blue     Red     8
8       Red      Orange  9

I would like to replace values for duplicate rows with respect to B and C and replace the whole row by 'ERR', ideally producing:

A       B        C       D
1       Blue     Green   4
ERR     ERR      ERR     ERR
ERR     ERR      ERR     ERR
4       Blue     Pink    6
ERR     ERR      ERR     ERR
ERR     ERR      ERR     ERR
7       Blue     Red     8
8       Red      Orange  9

So brief: If there are duplicate rows for columns B and C, all the values in those rows should be set to 'ERR' (not only the duplicate ones).

Solved! -> thanks to @anky_91

df = pd.DataFrame({"A": [1,2,3,4,5,6,7,8], "B": ['Blue', 'Red', 'Red', 'Blue', 'Blue', 'Blue', 'Blue', 'Red'], "C": ['Green', 'Green', 'Green', 'Pink', 'Orange', 'Orange', 'Red', 'Orange'], "D": [4,6,2,6,9,8,8,9]})

df = df.mask(df.duplicated(['B','C'], keep=False), 'ERR')
print(df)
4
  • 3
    df.mask(df.duplicated(['B','C'],keep=False),'ERR') ? Commented Feb 15, 2020 at 16:10
  • That's it! I'll add it to the post. Commented Feb 15, 2020 at 16:17
  • What would the 'ERR' mean? Could NaN or None be more appropriate? Commented Feb 16, 2020 at 4:55
  • Well I want to keep track of the 'errors' in the dataframe (if two rows have the same value for every column, there is no problem, I just erase one of but but if, two rows in B and C have the same paired value but the values in the others column are different then this is seen as an error in the dataframe which I would want to keep track of). I'm working with databases, I'm using np.NaN already for the gaps in the database, there needs to be a difference between 'error' en 'gap' my database ;-) Commented Feb 16, 2020 at 8:06

1 Answer 1

1

You can use df.mask here with df.duplicated

df.mask(df.duplicated(['B','C'],keep=False),'ERR')

     A     B       C    D
0    1  Blue   Green    4
1  ERR   ERR     ERR  ERR
2  ERR   ERR     ERR  ERR
3    4  Blue    Pink    6
4  ERR   ERR     ERR  ERR
5  ERR   ERR     ERR  ERR
6    7  Blue     Red    8
7    8   Red  Orange    9
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.