1

I have a list of words

wordlist = ['hypothesis' , 'test' , 'results' , 'total']

I have a sentence

sentence = "These tests will benefit in the long run."

I want to check to see if the words in wordlist are in the sentence. I know that you could check to see if they are substrings in the sentence using:

for word in wordlist:
    if word in sentence:
        print word

However, using substrings, I start to match words that are not in wordlist, for example here test will appear as a substring in sentence even though it is tests that is in the sentence. I could solve my problem by using regular expressions, however, is it possible to implement regular expressions in a way to be formatted with each new word, meaning if I want to see if the word is in the sentence then:

for some_word_goes_in_here in wordlist:
    if re.search('.*(some_word_goes_in_here).*', sentence):
         print some_word_goes_in_here

so in this case the regular expression would interpret some_word_goes_in_here as the pattern that needs to be searched for and not the value of some_word_goes_in_here. Is there a way to format the input of some_word_goes_in_here so that the regular expression searches for the value of some_word_goes_in_here?

1
  • If you have a better solution, I am eager to listen to it. Commented Jan 8, 2014 at 10:58

3 Answers 3

2

Use \b word boundaries to test for the words:

for word in wordlist:
    if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
        print '{} matched'.format(word)

but you could also just split the sentence into separate words. Using a set for the word list would make the test more efficient:

words = set(wordlist)
if words.intersection(sentence.split()):
    # no looping over `words` required.

Demo:

>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
...     if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
...         print '{} matched'.format(word)
... 
>>> words = set(wordlist)
>>> words.intersection(sentence.split())
set([])
>>> sentence = 'Lets test this hypothesis that the results total the outcome'
>>> for word in wordlist:
...     if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
...         print '{} matched'.format(word)
... 
hypothesis matched
test matched
results matched
total matched
>>> words.intersection(sentence.split())
set(['test', 'total', 'hypothesis', 'results'])
Sign up to request clarification or add additional context in comments.

2 Comments

I was considering using re.escape and decided against it since words don't need that escaping. In a more general case it is a good advice.
@MartjinPieters I think splitting the sentence into words could introduce error, as finding boundaries between words is not really a trivial task.
1

Try using:

if re.search(r'\b' + word + r'\b', sentence):

\b are word boundaries which will match between your word and a non word character (a word character is any letter, digit or underscore).

For instance,

>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "The total results for the test confirm the hypothesis"
>>> for word in wordlist:
...     if re.search(r'\b' + word + r'\b', sentence):
...             print word
...
hypothesis
test
results
total

With your string:

>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
...     if re.search(r'\b' + word + r'\b', sentence):
...          print word
...
>>>

Nothing is printed

5 Comments

@kolonel I used a different string, but let me put yours in a bit
@MartijnPieters I should have changed it ^^;. Did it now. Thanks!
@Jerry why does it fail if I replace r'\b' with '\b'?
@kolonel It's because \b is the backspace character. You'll have to pass the raw string \b (or else, you use re.search('\\b' + word + '\\b', sentence):) to the regex for it to mean the word boundary.
@kolonel You can call me 'Jerry'. You're welcome! ^^
1

I'd use this:

words = "hypothesis test results total".split()
# ^^^ but you can use your literal list if you prefer that
for word in words:
  if re.search(r'\b%s\b' % (word,), sentence):
    print word

You can even speed this up by using a single regexp:

for foundWord in re.findall(r'\b' + r'\b|\b'.join(words) + r'\b', sentence):
  print foundWord

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.