4

In Python 2.7, given this string:

Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.

what would be the best way to find the sum amount of "Spot"s, "brown"s, and "hair"s in the string? In the example, it would return 8.

I'm looking for something like string.count("Spot","brown","hair") but works with with the "strings to be found" in a tuple or list.

Thanks!

4
  • Do you want to count "hair" in "hairy"? The nltk answer does not count it, while the count() and the regular expressions answers do. Commented Mar 19, 2013 at 1:11
  • It's easy to exclude that with regex by adding word boundaries (\b). Commented Mar 19, 2013 at 1:13
  • Indeed, but this changes your answer. :) Commented Mar 19, 2013 at 1:21
  • Currently, my "strings to be found" is complicated enough to not get multiple hits like that, but I appreciate all the regex info and tips in case I ever have to come back. Also helps future googlers :D Commented Mar 19, 2013 at 1:28

2 Answers 2

14

This does what you asked for, but notice that it will also count words like "hairy", "browner" etc.

>>> s = "Spot is a brown dog. Spot has brown hair. The hair of Spot is brown."
>>> sum(s.count(x) for x in ("Spot", "brown", "hair"))
8

You can also write it as a map

>>> sum(map(s.count, ("Spot", "brown", "hair")))
8

A more robust solution might use the nltk package

>>> import nltk  # Natural Language Toolkit
>>> from collections import Counter
>>> sum(x in {"Spot", "brown", "hair"} for x in nltk.wordpunct_tokenize(s))
8
Sign up to request clarification or add additional context in comments.

4 Comments

I wasn't going to say anything about nltk since I don't know that package -- I would +1 again for that one if I could.
+1 for the nltk option, which does not count "hair" in "hairy"—in case this is what the original poster wants.
The nltk option is asymptotically faster than the count() one, since it only reads the input string once, and since the membership test is done in constant time.
How can one get the unique count of each phrase like "green spot", "long brown tail", "red hair", etc. and display the results in a table?
4

I might use a Counter:

s = 'Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'
words_we_want = ("Spot","brown","hair")
from collections import Counter
data = Counter(s.split())
print (sum(data[word] for word in words_we_want))

Note that this will under-count by 1 since 'brown.' and 'brown' are separate Counter entries.

A slightly less elegant solution that doesn't trip up on punctuation uses a regex:

>>> len(re.findall('Spot|brown|hair','Spot is a brown dog. Spot has brown hair. The hair of Spot is brown.'))
8

You can create the regex from a tuple simply by

'|'.join(re.escape(x) for x in words_we_want)

The nice thing about these solutions is that they have a much better algorithmic complexity compared to the solution by gnibbler. Of course, which actually performs better on real world data still needs to be measured by OP (since OP is the only one with the real world data)

2 Comments

And I suppose, with the regex, you could evaluate this lazily via re.finditer + the old standby sum(1 for _ in ...) idiom.
+1 for finditer() and regexps in general: they are fast, for larger strings and number of possible words.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.