Counting occurrences of multiple strings in another string

Question

In Python 2.7, given this string:

Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.

what would be the best way to find the sum amount of "Spot"s, "brown"s, and "hair"s in the string? In the example, it would return 8.

I'm looking for something like string.count("Spot","brown","hair") but works with with the "strings to be found" in a tuple or list.

Thanks!

Do you want to count "hair" in "hairy"? The nltk answer does not count it, while the count() and the regular expressions answers do. — Eric O. Lebigot
– Eric O. Lebigot, Commented Mar 19, 2013 at 1:11
It's easy to exclude that with regex by adding word boundaries (\b). — mgilson
– mgilson, Commented Mar 19, 2013 at 1:13
Currently, my "strings to be found" is complicated enough to not get multiple hits like that, but I appreciate all the regex info and tips in case I ever have to come back. Also helps future googlers :D — DharmaTurtle
– DharmaTurtle, Commented Mar 19, 2013 at 1:28

Eric O. Lebigot · Accepted Answer · 2013-03-19 01:13:34Z

14

This does what you asked for, but notice that it will also count words like "hairy", "browner" etc.

>>> s = "Spot is a brown dog. Spot has brown hair. The hair of Spot is brown."
>>> sum(s.count(x) for x in ("Spot", "brown", "hair"))
8

You can also write it as a map

>>> sum(map(s.count, ("Spot", "brown", "hair")))
8

A more robust solution might use the nltk package

>>> import nltk  # Natural Language Toolkit
>>> from collections import Counter
>>> sum(x in {"Spot", "brown", "hair"} for x in nltk.wordpunct_tokenize(s))
8

edited Mar 19, 2013 at 1:13

Eric O. Lebigot

95.1k49 gold badges223 silver badges263 bronze badges

answered Mar 19, 2013 at 0:50

John La Rooy

306k54 gold badges378 silver badges513 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

mgilson Over a year ago

I wasn't going to say anything about nltk since I don't know that package -- I would +1 again for that one if I could.

Eric O. Lebigot Over a year ago

+1 for the nltk option, which does not count "hair" in "hairy"—in case this is what the original poster wants.

Eric O. Lebigot Over a year ago

The nltk option is asymptotically faster than the count() one, since it only reads the input string once, and since the membership test is done in constant time.

Leo Jones Over a year ago

How can one get the unique count of each phrase like "green spot", "long brown tail", "red hair", etc. and display the results in a table?

mgilson · Accepted Answer · 2013-03-19 00:55:01Z

4

I might use a Counter:

s = 'Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'
words_we_want = ("Spot","brown","hair")
from collections import Counter
data = Counter(s.split())
print (sum(data[word] for word in words_we_want))

Note that this will under-count by 1 since 'brown.' and 'brown' are separate Counter entries.

A slightly less elegant solution that doesn't trip up on punctuation uses a regex:

>>> len(re.findall('Spot|brown|hair','Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'))
8

You can create the regex from a tuple simply by

'|'.join(re.escape(x) for x in words_we_want)

The nice thing about these solutions is that they have a much better algorithmic complexity compared to the solution by gnibbler. Of course, which actually performs better on real world data still needs to be measured by OP (since OP is the only one with the real world data)

edited Mar 19, 2013 at 0:55

answered Mar 19, 2013 at 0:49

mgilson

312k70 gold badges656 silver badges722 bronze badges

2 Comments

mgilson Over a year ago

And I suppose, with the regex, you could evaluate this lazily via re.finditer + the old standby sum(1 for _ in ...) idiom.

Eric O. Lebigot Over a year ago

+1 for finditer() and regexps in general: they are fast, for larger strings and number of possible words.

Collectives™ on Stack Overflow

Counting occurrences of multiple strings in another string

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related