Remove reverse duplicates from dataframe

Question

I have a data frame with two columns, A and B. The order of A and B is unimportant in this context; for example, I would consider (0,50) and (50,0) to be duplicates. In pandas, what is an efficient way to remove these duplicates from a dataframe?

import pandas as pd

# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
    A   B
0   0  50
1  10  22
2  11  35
3  21   5
4  22  10
5  35  11
6   5  21
7  50   0

# Desired output with "duplicates" removed. 
data2 = pd.DataFrame({'A': [0, 5, 10, 11], 
                      'B': [50, 21, 22, 35]})
data2
    A   B
0   0  50
1   5  21
2  10  22
3  11  35

Ideally, the output would be sorted by values of column A.

akuiper · Accepted Answer · 2016-11-07 21:22:43Z

15

You can sort each row of the data frame before dropping the duplicates:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

If you prefer the result to be sorted by column A:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35

answered Nov 7, 2016 at 21:22

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

root Over a year ago

No need for the lambda, .apply(sorted, axis=1) will work.

piRSquared Over a year ago

I love this answer! Everything I thought up consisted of stacking to dataframes. This cleverness eliminates that need.

MaxU - stand with Ukraine · Accepted Answer · 2016-11-07 21:39:17Z

14

Here is bit uglier, but faster solution:

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
    A   B
0   0  50
1  10  22
2  11  35
3   5  21

Timing: for 8K rows DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True)

In [51]: big.shape
Out[51]: (8000, 2)

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop

edited Nov 7, 2016 at 21:39

answered Nov 7, 2016 at 21:30

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

piRSquared Over a year ago

That is the same answer with vectorized implementation. NOT! uglier :-)

Farah Nazifa · Accepted Answer · 2020-06-09 08:12:29Z

0

Now this solution works,

data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()

More columns could be added as well as per necessity. e.g.

data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()

answered Jun 9, 2020 at 8:12

Farah Nazifa

9198 silver badges14 bronze badges

Comments

mohamed banihani · Accepted Answer · 2021-07-01 20:17:44Z

0

df.T.apply(sorted).T.drop_duplicates()

answered Jul 1, 2021 at 20:17

mohamed banihani

516 bronze badges

Comments

Vinay · Accepted Answer · 2021-07-25 20:38:09Z

0

Here is a bit lengthy solution, but might be helpful for beginners -

Creating new columns for sorting values from Column A & B across row -

data['C'] = np.where(data['A']<data['B'] , data['A'], data['B'])
data['D'] = np.where(data['A']>data['B'] , data['A'], data['B'])

Removing Duplicates & sorting as per column 'C' as requested in question & renaming the columns

data2 = data[['C', 'D']].drop_duplicates().sort_values('C')
data2.columns = ['A', 'B']   
data2

PS - "np.where" function works similar to If formula in excel (Logical Condition, Value if TRUE, Value if FALSE)

answered Jul 25, 2021 at 20:38

Vinay

1368 bronze badges

Comments

mozway · Accepted Answer · 2022-12-08 10:14:46Z

Another classical option is to aggregate the values as a frozenset and to use boolean indexing

out = data[~data[['A', 'B']].agg(frozenset, axis=1).duplicated()]

Output:

It's also fairly efficient, although not as much as the very optimized np.sort approach:

%timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
27.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
733 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit big.apply(np.sort, axis = 1).drop_duplicates()
12 s ± 403 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit big[~big[['A', 'B']].agg(frozenset, axis=1).duplicated()]
25 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Collectives™ on Stack Overflow

Remove reverse duplicates from dataframe

6 Answers 6

2 Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related