1

I would like to drop all values which are duplicates across a subset of two or more columns, without removing the entire row.

Dataframe:

    A   B   C
0   foo g   A
1   foo g   G
2   yes y   B
3   bar y   B

Desired result:

    A   B   C
0   foo g   A
1   NaN NaN G
2   yes y   B
3   bar Nan NaN

I have tried the drop_duplicates() feature by grouping data into new data frames by columns and then re-appending them together, but this had its own issues.

I have also tried this solution and this one, but still am stuck. Any guidance would be much appreciated.

(updated original question)

3 Answers 3

1

Go through the codes. You will clearly see the difference between mask and where.

import pandas as pd
import numpy as np


df = pd.DataFrame(columns=['A','B','C'])
df['A'] = ['foo','foo', 'yes','bar' ]
df['B'] = ['g','g', 'y', 'y']
df['C'] = ['A','G','B','B']
print(df)
"""
     A  B  C
0  foo  g  A
1  foo  g  G
2  yes  y  B
3  bar  y  B

"""

aa = df.apply(pd.Series.duplicated)
print(aa)
"""
       A      B      C
0  False  False  False
1   True   True  False
2  False  False  False
3  False   True   True
"""
using_where = df.where(~aa)
print(using_where)
"""
    A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN

"""
using_mask = df.mask(aa)
print(using_mask)

"""
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN
"""
Sign up to request clarification or add additional context in comments.

2 Comments

It's a great answer. Can I ask you what the flag ~ does in where condition?
The ~ operator in Python performs a bitwise NOT operation. When applied to a boolean Series like m (representing the mask for each group), it inverts the True/False values.
0

Without removing the entire rows, you can filter the duplicated value with NaN.

#df : your dataframe    
for c_name in df.columns:
      duplicated = df.duplicated(c_name)
      df.loc[duplicated, [c_name]] = np.NaN
    
    print(df)

I referred to this.

3 Comments

thanks for your answer @HSL. In the above example i have wanted these removed as the duplicates occur across a subset of more than 1 column (2) but not 1 column alone. Do you know how to change your code to only replace duplicates with NaN if they occur over >2 columns? (I have updated my original question)
@btroppo I cannot understand what you say. What do you want to add to this? This table that you explains is 4 rows and 3 columns. You mean, check duplicates over more 2 columns, not 1? Because this code is checking just 1 column?
sorry, yes thats right, your code removes duplicates occuring in each column but if its possible i need duplicates removed if they occur over >2 columns? For example if i use your code for the dataframe i gave in the example but in col A row 0 a different string was present, say ”eg”, then the output dataframe would replace col B row 1 ”g” with NaN, but i dont want this
0

try this:

result = df.mask(df.apply(pd.Series.duplicated))
print(result)
>>>
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN

1 Comment

Many thanks im sorry i should have been a bit clearer and have updated original post, i want to keep data in certain columns when they occur as duplicates only in the one column. For example, in the above example i have wanted these removed as the duplicates occur across a subset of more than 1 column (2) but not one column alone? Does that make sense?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.