Formatting regular expressions in Python

Question

I have a list of words

wordlist = ['hypothesis' , 'test' , 'results' , 'total']

I have a sentence

sentence = "These tests will benefit in the long run."

I want to check to see if the words in wordlist are in the sentence. I know that you could check to see if they are substrings in the sentence using:

for word in wordlist:
    if word in sentence:
        print word

However, using substrings, I start to match words that are not in wordlist, for example here test will appear as a substring in sentence even though it is tests that is in the sentence. I could solve my problem by using regular expressions, however, is it possible to implement regular expressions in a way to be formatted with each new word, meaning if I want to see if the word is in the sentence then:

for some_word_goes_in_here in wordlist:
    if re.search('.*(some_word_goes_in_here).*', sentence):
         print some_word_goes_in_here

so in this case the regular expression would interpret some_word_goes_in_here as the pattern that needs to be searched for and not the value of some_word_goes_in_here. Is there a way to format the input of some_word_goes_in_here so that the regular expression searches for the value of some_word_goes_in_here?

If you have a better solution, I am eager to listen to it.

kolonel
– kolonel

2014-01-08 10:58:27 +00:00
Commented Jan 8, 2014 at 10:58 — kolonel
– kolonel, Commented Jan 8, 2014 at 10:58

Martijn Pieters · Accepted Answer · 2014-01-08 11:14:44Z

2

Use \b word boundaries to test for the words:

for word in wordlist:
    if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
        print '{} matched'.format(word)

but you could also just split the sentence into separate words. Using a set for the word list would make the test more efficient:

words = set(wordlist)
if words.intersection(sentence.split()):
    # no looping over `words` required.

Demo:

>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
...     if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
...         print '{} matched'.format(word)
... 
>>> words = set(wordlist)
>>> words.intersection(sentence.split())
set([])
>>> sentence = 'Lets test this hypothesis that the results total the outcome'
>>> for word in wordlist:
...     if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
...         print '{} matched'.format(word)
... 
hypothesis matched
test matched
results matched
total matched
>>> words.intersection(sentence.split())
set(['test', 'total', 'hypothesis', 'results'])

edited Jan 8, 2014 at 11:14

answered Jan 8, 2014 at 11:00

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alfe Over a year ago

I was considering using re.escape and decided against it since words don't need that escaping. In a more general case it is a good advice.

kolonel Over a year ago

@MartjinPieters I think splitting the sentence into words could introduce error, as finding boundaries between words is not really a trivial task.

Jerry · Accepted Answer · 2014-01-08 11:03:59Z

1

Try using:

if re.search(r'\b' + word + r'\b', sentence):

\b are word boundaries which will match between your word and a non word character (a word character is any letter, digit or underscore).

For instance,

>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "The total results for the test confirm the hypothesis"
>>> for word in wordlist:
...     if re.search(r'\b' + word + r'\b', sentence):
...             print word
...
hypothesis
test
results
total

With your string:

>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
...     if re.search(r'\b' + word + r'\b', sentence):
...          print word
...
>>>

Nothing is printed

edited Jan 8, 2014 at 11:03

answered Jan 8, 2014 at 10:58

Jerry

71.8k14 gold badges106 silver badges148 bronze badges

5 Comments

Jerry Over a year ago

@kolonel I used a different string, but let me put yours in a bit

Jerry Over a year ago

@MartijnPieters I should have changed it ^^;. Did it now. Thanks!

kolonel Over a year ago

@Jerry why does it fail if I replace r'\b' with '\b'?

Jerry Over a year ago

@kolonel It's because \b is the backspace character. You'll have to pass the raw string \b (or else, you use re.search('\\b' + word + '\\b', sentence):) to the regex for it to mean the word boundary.

Jerry Over a year ago

@kolonel You can call me 'Jerry'. You're welcome! ^^

Alfe · Accepted Answer · 2014-01-08 11:03:20Z

1

I'd use this:

words = "hypothesis test results total".split()
# ^^^ but you can use your literal list if you prefer that
for word in words:
  if re.search(r'\b%s\b' % (word,), sentence):
    print word

You can even speed this up by using a single regexp:

for foundWord in re.findall(r'\b' + r'\b|\b'.join(words) + r'\b', sentence):
  print foundWord

answered Jan 8, 2014 at 11:03

Alfe

60.2k21 gold badges117 silver badges172 bronze badges

Collectives™ on Stack Overflow

Formatting regular expressions in Python

3 Answers 3

2 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related