16

I have a data frame with two columns, A and B. The order of A and B is unimportant in this context; for example, I would consider (0,50) and (50,0) to be duplicates. In pandas, what is an efficient way to remove these duplicates from a dataframe?

import pandas as pd

# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
    A   B
0   0  50
1  10  22
2  11  35
3  21   5
4  22  10
5  35  11
6   5  21
7  50   0

# Desired output with "duplicates" removed. 
data2 = pd.DataFrame({'A': [0, 5, 10, 11], 
                      'B': [50, 21, 22, 35]})
data2
    A   B
0   0  50
1   5  21
2  10  22
3  11  35

Ideally, the output would be sorted by values of column A.

6 Answers 6

15

You can sort each row of the data frame before dropping the duplicates:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

If you prefer the result to be sorted by column A:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35
Sign up to request clarification or add additional context in comments.

2 Comments

No need for the lambda, .apply(sorted, axis=1) will work.
I love this answer! Everything I thought up consisted of stacking to dataframes. This cleverness eliminates that need.
14

Here is bit uglier, but faster solution:

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
    A   B
0   0  50
1  10  22
2  11  35
3   5  21

Timing: for 8K rows DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True)

In [51]: big.shape
Out[51]: (8000, 2)

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop

1 Comment

That is the same answer with vectorized implementation. NOT! uglier :-)
0

Now this solution works,

data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()

More columns could be added as well as per necessity. e.g.

data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()

Comments

0

df.T.apply(sorted).T.drop_duplicates()

Comments

0

Here is a bit lengthy solution, but might be helpful for beginners -

Creating new columns for sorting values from Column A & B across row -

data['C'] = np.where(data['A']<data['B'] , data['A'], data['B'])
data['D'] = np.where(data['A']>data['B'] , data['A'], data['B'])

Removing Duplicates & sorting as per column 'C' as requested in question & renaming the columns

data2 = data[['C', 'D']].drop_duplicates().sort_values('C')
data2.columns = ['A', 'B']   
data2

PS - "np.where" function works similar to If formula in excel (Logical Condition, Value if TRUE, Value if FALSE)

Comments

0

Another classical option is to aggregate the values as a frozenset and to use boolean indexing

out = data[~data[['A', 'B']].agg(frozenset, axis=1).duplicated()]

Output:

    A   B
0   0  50
1  10  22
2  11  35
3  21   5

It's also fairly efficient, although not as much as the very optimized np.sort approach:

%timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
27.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
733 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit big.apply(np.sort, axis = 1).drop_duplicates()
12 s ± 403 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit big[~big[['A', 'B']].agg(frozenset, axis=1).duplicated()]
25 ms ± 657 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.