Remove duplicate values across columns in pandas dataframe, without removing entire row

Question

I would like to drop all values which are duplicates across a subset of two or more columns, without removing the entire row.

Dataframe:

    A   B   C
0   foo g   A
1   foo g   G
2   yes y   B
3   bar y   B

Desired result:

    A   B   C
0   foo g   A
1   NaN NaN G
2   yes y   B
3   bar Nan NaN

I have tried the drop_duplicates() feature by grouping data into new data frames by columns and then re-appending them together, but this had its own issues.

I have also tried this solution and this one, but still am stuck. Any guidance would be much appreciated.

(updated original question)

Soudipta Dutta · Accepted Answer · 2023-03-16 12:46:29Z

1

Go through the codes. You will clearly see the difference between mask and where.

import pandas as pd
import numpy as np


df = pd.DataFrame(columns=['A','B','C'])
df['A'] = ['foo','foo', 'yes','bar' ]
df['B'] = ['g','g', 'y', 'y']
df['C'] = ['A','G','B','B']
print(df)
"""
     A  B  C
0  foo  g  A
1  foo  g  G
2  yes  y  B
3  bar  y  B

"""

aa = df.apply(pd.Series.duplicated)
print(aa)
"""
       A      B      C
0  False  False  False
1   True   True  False
2  False  False  False
3  False   True   True
"""
using_where = df.where(~aa)
print(using_where)
"""
    A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN

"""
using_mask = df.mask(aa)
print(using_mask)

"""
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN
"""

edited Mar 16, 2023 at 12:46

answered Mar 16, 2023 at 12:31

Soudipta Dutta

2,0721 gold badge16 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

HSL Over a year ago

It's a great answer. Can I ask you what the flag ~ does in where condition?

Soudipta Dutta Over a year ago

The ~ operator in Python performs a bitwise NOT operation. When applied to a boolean Series like m (representing the mask for each group), it inverts the True/False values.

HSL · Accepted Answer · 2023-03-16 04:48:09Z

0

Without removing the entire rows, you can filter the duplicated value with NaN.

#df : your dataframe    
for c_name in df.columns:
      duplicated = df.duplicated(c_name)
      df.loc[duplicated, [c_name]] = np.NaN
    
    print(df)

I referred to this.

answered Mar 16, 2023 at 4:48

HSL

1581 silver badge7 bronze badges

3 Comments

btroppo Over a year ago

thanks for your answer @HSL. In the above example i have wanted these removed as the duplicates occur across a subset of more than 1 column (2) but not 1 column alone. Do you know how to change your code to only replace duplicates with NaN if they occur over >2 columns? (I have updated my original question)

HSL Over a year ago

@btroppo I cannot understand what you say. What do you want to add to this? This table that you explains is 4 rows and 3 columns. You mean, check duplicates over more 2 columns, not 1? Because this code is checking just 1 column?

btroppo Over a year ago

sorry, yes thats right, your code removes duplicates occuring in each column but if its possible i need duplicates removed if they occur over >2 columns? For example if i use your code for the dataframe i gave in the example but in col A row 0 a different string was present, say ”eg”, then the output dataframe would replace col B row 1 ”g” with NaN, but i dont want this

ziying35 · Accepted Answer · 2023-03-16 06:07:04Z

0

try this:

result = df.mask(df.apply(pd.Series.duplicated))
print(result)
>>>
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN

answered Mar 16, 2023 at 6:07

ziying35

1,3155 silver badges6 bronze badges

1 Comment

btroppo Over a year ago

Many thanks im sorry i should have been a bit clearer and have updated original post, i want to keep data in certain columns when they occur as duplicates only in the one column. For example, in the above example i have wanted these removed as the duplicates occur across a subset of more than 1 column (2) but not one column alone? Does that make sense?

Collectives™ on Stack Overflow

Remove duplicate values across columns in pandas dataframe, without removing entire row

3 Answers 3

2 Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related