2

I have a dataframe with a list of sentences in one column and am trying to create a new column equal to the number of times a list of strings show up.

For example, the relevant dataframe looks like

book['sentences']
0 The brown dog jumped over the big moon
1 The brown fox slid under the brown log

I'm trying to count the number of times "brown", "over", and "log" show up in each sentences (i.e. the new column would be equal to 2 and 3).

I know I can do this with str.count, but only for one string at a time and then I would have to add them up

book['count_brown'] = book['sentences'].str.count('brown')
book['count_over'] = book['sentences'].str.count('over')
book['count_log'] = book['sentences'].str.count('log')
book['count'] = book['count_brown']+book['count_over']+book['count_log']

My list of strings I am searching for is over 300 words long so even with a loop it doesn't seem optimal. Is there a better way to do this?

3 Answers 3

3

Ganky!

lst = ['brown', 'over', 'log']

book['sentences'].str.extractall(
    '({})'.format('|'.join(lst))
).groupby(level=0)[0].value_counts().unstack(fill_value=0)

0  brown  log  over
0      1    0     1
1      2    1     0
Sign up to request clarification or add additional context in comments.

2 Comments

What does Ganky mean?
@cᴏʟᴅsᴘᴇᴇᴅ nasty gross terrible
1

Similar to piRSquared's solution, but uses get_dummies and sum for the counts.

df
                                sentences
0  The brown dog jumped over the big moon
1  The brown fox slid under the brown log

words = ['brown', 'over', 'log']
df = df.sentences.str.extractall('({})'.format('|'.join(words)))\
                           .iloc[:, 0].str.get_dummies().sum(level=0)
df
   brown  log  over
0      1    0     1
1      2    1     0

If you want row-wise counts of all words in a single column, just sum along the first axis.

df.sum(1)
0    2
1    3
dtype: int64 

Comments

1

With the help of nltk Frequency distribution you can do that very easily i.e

import nltk 
lst = ['brown', 'over', 'log']
ndf = df['sentences'].apply(nltk.tokenize.word_tokenize).apply(nltk.FreqDist).apply(pd.Series)[lst].fillna(0)

Output:

   brown  over  log
0    1.0   1.0  0.0
1    2.0   0.0  1.0

For sum

ndf['count'] = ndf.sum(1)
   brown  over  log  count
0    1.0   1.0  0.0    2.0
1    2.0   0.0  1.0    3.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.