Count the number of times multiple substrings appear in dataframe column

Question

I have a dataframe with a list of sentences in one column and am trying to create a new column equal to the number of times a list of strings show up.

For example, the relevant dataframe looks like

book['sentences']
0 The brown dog jumped over the big moon
1 The brown fox slid under the brown log

I'm trying to count the number of times "brown", "over", and "log" show up in each sentences (i.e. the new column would be equal to 2 and 3).

I know I can do this with str.count, but only for one string at a time and then I would have to add them up

book['count_brown'] = book['sentences'].str.count('brown')
book['count_over'] = book['sentences'].str.count('over')
book['count_log'] = book['sentences'].str.count('log')
book['count'] = book['count_brown']+book['count_over']+book['count_log']

My list of strings I am searching for is over 300 words long so even with a loop it doesn't seem optimal. Is there a better way to do this?

piRSquared · Accepted Answer · 2017-09-16 05:49:44Z

3

Ganky!

lst = ['brown', 'over', 'log']

book['sentences'].str.extractall(
    '({})'.format('|'.join(lst))
).groupby(level=0)[0].value_counts().unstack(fill_value=0)

0  brown  log  over
0      1    0     1
1      2    1     0

answered Sep 16, 2017 at 5:49

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

cs95 Over a year ago

What does Ganky mean?

piRSquared Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ nasty gross terrible

cs95 · Accepted Answer · 2017-09-16 06:22:38Z

1

Similar to piRSquared's solution, but uses get_dummies and sum for the counts.

df
                                sentences
0  The brown dog jumped over the big moon
1  The brown fox slid under the brown log

words = ['brown', 'over', 'log']
df = df.sentences.str.extractall('({})'.format('|'.join(words)))\
                           .iloc[:, 0].str.get_dummies().sum(level=0)
df
   brown  log  over
0      1    0     1
1      2    1     0

If you want row-wise counts of all words in a single column, just sum along the first axis.

df.sum(1)
0    2
1    3
dtype: int64

edited Sep 16, 2017 at 6:22

answered Sep 16, 2017 at 6:16

cs95

406k106 gold badges744 silver badges797 bronze badges

Comments

Bharath M Shetty · Accepted Answer · 2017-09-16 07:35:30Z

1

With the help of nltk Frequency distribution you can do that very easily i.e

import nltk 
lst = ['brown', 'over', 'log']
ndf = df['sentences'].apply(nltk.tokenize.word_tokenize).apply(nltk.FreqDist).apply(pd.Series)[lst].fillna(0)

Output:

   brown  over  log
0    1.0   1.0  0.0
1    2.0   0.0  1.0

For sum

ndf['count'] = ndf.sum(1)

   brown  over  log  count
0    1.0   1.0  0.0    2.0
1    2.0   0.0  1.0    3.0

edited Sep 16, 2017 at 7:35

answered Sep 16, 2017 at 6:55

Bharath M Shetty

30.6k6 gold badges65 silver badges111 bronze badges

Collectives™ on Stack Overflow

Count the number of times multiple substrings appear in dataframe column

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related