6

I have the following simple data frame

import pandas as pd
df = pd.DataFrame({'column_a': ['a', 'b', 'c', 'd', 'e'],
                   'column_b': ['b', 'x', 'y', 'c', 'z']})


      column_a column_b
0        a        b
1        b        x
2        c        y
3        d        c
4        e        z

I'm looking to display the strings which occur in both columns:

result = ("b", "c")

Thanks

3
  • 3
    Can you use: np.intersect1d(df.column_a, df.column_b) ? Commented Nov 21, 2018 at 15:48
  • 1
    @JonClements please add that as an answer. That is useful and needs to voted upon. Commented Nov 21, 2018 at 15:52
  • @piRSquared well, depending if multiple occurrences need preserving or ordering shouldn't be sorted etc... it's an option, but I'm fairly sure there's a canonical post for this somewhere... Commented Nov 21, 2018 at 15:55

7 Answers 7

7

intersection

This generalizes over any number of columns.

set.intersection(*map(set, map(df.get, df)))

{'b', 'c'}
Sign up to request clarification or add additional context in comments.

8 Comments

Alternatively: set(df.iloc[:, 0]).intersection(*df.iloc[:, 1:]) Or maybe just: set(df.iloc[:, 0]).intersection(*df.values.T) as it doesn't matter if you compare the first set to itself really... (bit of a waste but...) - does mean you can pick the column more easily that's probably got the most unique values to reduce comparisons...
I'd personally go for: reduce(np.intersect1d, df.values.T) though :)
couple of things: 1: map(df.get, df) should be faster than df.values.T in a general sense. (I'd have to commit to some testing if I wanted more conviction) 2: (lambda v: set(v[0]).intersection(*v[1:]))(df.values.T)
Depends on how efficient np.intersect1d is though...?
Yeah, I need to play with that.
|
5

Use python's set object:

in_a = set(df.column_a)
in_b = set(df.column_b)
in_both = in_a.intersection(in_b)

2 Comments

Works as a one liner too: result = set(df.column_a).intersection(df.column_b)
This indeed great for the use case OP asked for specific columns as asked.
4

Similar to Sandeep Kadapa's solution. (Without tolist and loc.)

>>> tuple(df['column_a'][df['column_a'].isin(df['column_b'])])                                            
('b', 'c')

Comments

2

Data

n = 10e3

ints = pd.DataFrame({'column_a': [1, 2, 3, 4, 5] * n,
                   'column_b': [2, 10, 9, 3, 8] * n})

strings = pd.DataFrame({'column_a': ['a', 'b', 'c', 'd', 'e'] * n,
                   'column_b': ['b', 'x', 'y', 'c', 'z'] * n})

Methods

def using_isin(df):  # @timgeb
    return df['column_a'][df['column_a'].isin(df['column_b'])]

def using_isin_loc_tolist(df):  # @SandeepKadapa
    return df.loc[df['column_a'].isin(df['column_b'].tolist()),'column_a']

def using_melt_groupby(df):  # @W-B
    return df.melt().groupby('value').variable.nunique().loc[lambda x : x>1].index

def using_set_intersection(df):  # @GergesDib, @TBurgins
    return set(df['column_a']).intersection(set(df['column_b']))

def using_set_intersection_map(df):  # @piRSquared
    return set.intersection(*map(set, map(df.get, df)))

def using_reduce_np_intersect(df):  # @JonClements
    return reduce(np.intersect1d, df.values.T)

def using_np_any(df):  # @W-B
    return df.column_a[np.any(df['column_a'].values == df['column_b'].values[:, None], 0)]

Performance if columns contain ints

%timeit -n 10 using_isin(ints)
977 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_reduce_np_intersect(ints)
1.31 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_set_intersection(ints)
1.54 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_set_intersection_map(ints)
1.59 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin_loc_tolist(ints)
2.39 ms ± 921 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_melt_groupby(ints)
34.2 ms ± 988 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_np_any(ints)
4.35 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performance if columns contain strings

%timeit -n 10 using_set_intersection_map(strings)
1.16 ms ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_intersection_set(strings)
1.2 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin(strings)
1.69 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin_loc_tolist(strings)
2.15 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_melt_groupby(strings)
35.6 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_reduce_np_intersect(strings)
43 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_np_any(strings)
# too slow to count

1 Comment

np I was trying to figure out why using np.arrays didn't really matter in this case so was doing them anyway.
1

Use isin and tuple as:

tuple(df.loc[df['column_a'].isin(df['column_b'].tolist()),'column_a'])
('b', 'c')

Comments

1

This is essentially concept (using sets) the same as the posted answers, but I feel it is a little simpler:

set(df.column_a) & set(df.column_b)

Comments

1

Using melt

df.melt().groupby('value').variable.nunique().loc[lambda x : x>1].index
Out[79]: Index(['b', 'c'], dtype='object', name='value')

If speed matter

s1 = df['column_a'].values
s2 = df['column_b'].values

df.column_a[np.any(s1 == s2[:, None], 0)]

5 Comments

Kind of along the same lines: df.melt().value.value_counts()[lambda v: v > 1].index
Why would you use this seeing that it is slower than slow (like 30 times slower) compared to set.intersection(set)?
@user3471881 I just provide different Idea, also this work when there is more than two columns .
I get that it works for multiple columns. But we can accomplish that in better ways and this is soooo expensive.
df.column[np.any(...)] is even slower, at least when n = 10e3

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.