python : pandas: remove duplicates from 2 columns

Question

I have a dataframe like the one below:

   A    B  
 chair  bed  
 bed    chair  
 spoon  knife  
 plate  cup

So, row 1 and 2 are duplicates for me and I want them removed. How can I do this in a simple way?

So after removing duplicates I will have:

  A     B      
spoon knife  
plate  cup

Thank you.

jezrael · Accepted Answer · 2018-02-27 16:36:06Z

Use boolean indexing with converted mask by ~:

df = df[~pd.DataFrame(np.sort(df[['A','B']], axis=1)).duplicated(keep=False)]

Another slowier solution:

df = df[~df[['A','B']].apply(sorted, axis=1).duplicated(keep=False)]

print (df)
       A      B
2  spoon  knife
3  plate    cup

Detail:

print (pd.DataFrame(np.sort(df[['A','B']], axis=1)))
       0      1
0    bed  chair
1    bed  chair
2  knife  spoon
3    cup  plate

print (pd.DataFrame(np.sort(df[['A','B']], axis=1)).duplicated(keep=False))
0     True
1     True
2    False
3    False
dtype: bool

Timings:

df = pd.concat([df] * 10000, ignore_index=True)

In [441]: %%timeit
     ...: df[~pd.DataFrame(np.sort(df[['A','B']], axis=1)).duplicated(keep=False)]
     ...: 
100 loops, best of 3: 9.38 ms per loop

In [442]: %%timeit
     ...: df[~df[['A','B']].apply(sorted, axis=1).duplicated(keep=False)]
     ...: 
1 loop, best of 3: 4.46 s per loop

#jpp solution
In [443]: %%timeit
     ...: df['C'] = list(map(frozenset, df[['A', 'B']].values.tolist()))
     ...: df.drop_duplicates('C', keep=False).drop('C', 1)
     ...: 
10 loops, best of 3: 28.4 ms per loop

jpp · Accepted Answer · 2018-02-27 10:41:04Z

1

This is one way using frozenset:

df['C'] = list(map(frozenset, df[['A', 'B']].values.tolist()))
df = df.drop_duplicates('C', keep=False).drop('C', 1)

Result

       A      B
2  spoon  knife
3  plate    cup

Explanation

First create frozenset column 'C' from 'A' and 'B'.
Drop duplicates, setting keep=False, and drop column 'C'.
frozenset is required instead of set since sets are not hashable.

edited Feb 27, 2018 at 10:41

answered Feb 27, 2018 at 10:28

jpp

166k37 gold badges301 silver badges363 bronze badges

Collectives™ on Stack Overflow

python : pandas: remove duplicates from 2 columns

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related