Count instances of strings in multiple columns python

Question

I have the following simple data frame

import pandas as pd
df = pd.DataFrame({'column_a': ['a', 'b', 'c', 'd', 'e'],
                   'column_b': ['b', 'x', 'y', 'c', 'z']})


      column_a column_b
0        a        b
1        b        x
2        c        y
3        d        c
4        e        z

I'm looking to display the strings which occur in both columns:

result = ("b", "c")

Thanks

@JonClements please add that as an answer. That is useful and needs to voted upon. — piRSquared
– piRSquared, Commented Nov 21, 2018 at 15:52
@piRSquared well, depending if multiple occurrences need preserving or ordering shouldn't be sorted etc... it's an option, but I'm fairly sure there's a canonical post for this somewhere... — Jon Clements
– Jon Clements, Commented Nov 21, 2018 at 15:55

piRSquared · Accepted Answer · 2018-11-21 16:07:01Z

7

`intersection`

This generalizes over any number of columns.

set.intersection(*map(set, map(df.get, df)))

{'b', 'c'}

edited Nov 21, 2018 at 16:07

answered Nov 21, 2018 at 15:51

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Jon Clements Over a year ago

Alternatively: set(df.iloc[:, 0]).intersection(*df.iloc[:, 1:]) Or maybe just: set(df.iloc[:, 0]).intersection(*df.values.T) as it doesn't matter if you compare the first set to itself really... (bit of a waste but...) - does mean you can pick the column more easily that's probably got the most unique values to reduce comparisons...

Jon Clements Over a year ago

I'd personally go for: reduce(np.intersect1d, df.values.T) though :)

piRSquared Over a year ago

couple of things: 1: map(df.get, df) should be faster than df.values.T in a general sense. (I'd have to commit to some testing if I wanted more conviction) 2: (lambda v: set(v[0]).intersection(*v[1:]))(df.values.T)

Jon Clements Over a year ago

Depends on how efficient np.intersect1d is though...?

piRSquared Over a year ago

Yeah, I need to play with that.

|

T Burgis · Accepted Answer · 2018-11-21 15:49:10Z

5

Use python's set object:

in_a = set(df.column_a)
in_b = set(df.column_b)
in_both = in_a.intersection(in_b)

answered Nov 21, 2018 at 15:49

T Burgis

1,4359 silver badges9 bronze badges

2 Comments

johnpaton Over a year ago

Works as a one liner too: result = set(df.column_a).intersection(df.column_b)

Karn Kumar Over a year ago

This indeed great for the use case OP asked for specific columns as asked.

timgeb · Accepted Answer · 2018-11-21 15:55:51Z

4

Similar to Sandeep Kadapa's solution. (Without tolist and loc.)

>>> tuple(df['column_a'][df['column_a'].isin(df['column_b'])])                                            
('b', 'c')

answered Nov 21, 2018 at 15:55

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

Comments

user3471881 · Accepted Answer · 2018-11-21 17:14:48Z

Data

n = 10e3

ints = pd.DataFrame({'column_a': [1, 2, 3, 4, 5] * n,
                   'column_b': [2, 10, 9, 3, 8] * n})

strings = pd.DataFrame({'column_a': ['a', 'b', 'c', 'd', 'e'] * n,
                   'column_b': ['b', 'x', 'y', 'c', 'z'] * n})

Methods

def using_isin(df):  # @timgeb
    return df['column_a'][df['column_a'].isin(df['column_b'])]

def using_isin_loc_tolist(df):  # @SandeepKadapa
    return df.loc[df['column_a'].isin(df['column_b'].tolist()),'column_a']

def using_melt_groupby(df):  # @W-B
    return df.melt().groupby('value').variable.nunique().loc[lambda x : x>1].index

def using_set_intersection(df):  # @GergesDib, @TBurgins
    return set(df['column_a']).intersection(set(df['column_b']))

def using_set_intersection_map(df):  # @piRSquared
    return set.intersection(*map(set, map(df.get, df)))

def using_reduce_np_intersect(df):  # @JonClements
    return reduce(np.intersect1d, df.values.T)

def using_np_any(df):  # @W-B
    return df.column_a[np.any(df['column_a'].values == df['column_b'].values[:, None], 0)]

Performance if columns contain ints

%timeit -n 10 using_isin(ints)
977 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_reduce_np_intersect(ints)
1.31 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_set_intersection(ints)
1.54 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_set_intersection_map(ints)
1.59 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin_loc_tolist(ints)
2.39 ms ± 921 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_melt_groupby(ints)
34.2 ms ± 988 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_np_any(ints)
4.35 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performance if columns contain strings

%timeit -n 10 using_set_intersection_map(strings)
1.16 ms ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_intersection_set(strings)
1.2 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin(strings)
1.69 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin_loc_tolist(strings)
2.15 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_melt_groupby(strings)
35.6 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_reduce_np_intersect(strings)
43 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_np_any(strings)
# too slow to count

np I was trying to figure out why using np.arrays didn't really matter in this case so was doing them anyway.

Space Impact · Accepted Answer · 2018-11-21 15:51:00Z

1

Use isin and tuple as:

tuple(df.loc[df['column_a'].isin(df['column_b'].tolist()),'column_a'])
('b', 'c')

answered Nov 21, 2018 at 15:51

Space Impact

13.3k26 silver badges51 bronze badges

Comments

Gerges · Accepted Answer · 2018-11-21 15:57:58Z

1

This is essentially concept (using sets) the same as the posted answers, but I feel it is a little simpler:

set(df.column_a) & set(df.column_b)

answered Nov 21, 2018 at 15:57

Gerges

6,6492 gold badges28 silver badges49 bronze badges

Comments

BENY · Accepted Answer · 2018-11-21 16:39:57Z

1

Using melt

df.melt().groupby('value').variable.nunique().loc[lambda x : x>1].index
Out[79]: Index(['b', 'c'], dtype='object', name='value')

If speed matter

s1 = df['column_a'].values
s2 = df['column_b'].values

df.column_a[np.any(s1 == s2[:, None], 0)]

edited Nov 21, 2018 at 16:39

answered Nov 21, 2018 at 16:03

BENY

324k22 gold badges176 silver badges250 bronze badges

5 Comments

Jon Clements Over a year ago

Kind of along the same lines: df.melt().value.value_counts()[lambda v: v > 1].index

user3471881 Over a year ago

Why would you use this seeing that it is slower than slow (like 30 times slower) compared to set.intersection(set)?

BENY Over a year ago

@user3471881 I just provide different Idea, also this work when there is more than two columns .

user3471881 Over a year ago

I get that it works for multiple columns. But we can accomplish that in better ways and this is soooo expensive.

user3471881 Over a year ago

df.column[np.any(...)] is even slower, at least when n = 10e3

Collectives™ on Stack Overflow

Count instances of strings in multiple columns python

7 Answers 7

`intersection`

8 Comments

2 Comments

Comments

1 Comment

Comments

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

intersection

8 Comments

2 Comments

Comments

1 Comment

Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`intersection`