1

What is the best way to search for matching words inside a string?

Right now I do something like the following:

if re.search('([h][e][l][l][o])',file_name_tmp, re.IGNORECASE):

Which works but its slow as I have probably around 100 different regex statements searching for full words so I'd like to combine several using a | separator or something.

2
  • 4
    A single character inside [] is really pointless. Get a decent introduction to regexes, you seem to have trouble with at least one if its most basic parts... Commented Oct 18, 2010 at 19:14
  • 1
    Fyi, You've vastly overcomplicated your regex. if re.search('hello'),file_name_tmp, re.IGNORECASE) would be exactly the same. Commented Oct 18, 2010 at 19:59

4 Answers 4

3
>>> words = ('hello', 'good\-bye', 'red', 'blue')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)
>>> sentence = 'SAY HeLLo TO reD, good-bye to Blue.'
>>> print pattern.findall(sentence)
['HeLLo', 'reD', 'good-bye', 'Blue']
Sign up to request clarification or add additional context in comments.

1 Comment

+1 Good answer. However, I think it's also important to point out word-boundary conditions/options available.
3

Can you try:

if 'hello' in longtext:

or

if 'HELLO' in longtext.upper():

to match hello/Hello/HELLO.

Comments

2

If you are trying to check 'hello' or a complete word in a string, you could also do

if 'hello' in stringToMatch:
    ... # Match found , do something

To find various strings, you could also use find all

>>>toMatch = 'e3e3e3eeehellloqweweemeeeeefe'
>>>regex = re.compile("hello|me",re.IGNORECASE)
>>>print regex.findall(toMatch)
>>>[u'me']
>>>toMatch = 'e3e3e3eeehelloqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'hello', u'me']
>>>toMtach = 'e3e3e3eeeHelLoqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'HelLo', u'me']

2 Comments

that works, however I still need the regex functionality of a returning a group of matches as sometimes the words in the string are uppercase or lowercase
@Joe: In that case you could use regex with | statement . See my edited reply
2

You say you want to search for WORDS. What is your definition of a "word"? If you are looking for "meet", do you really want to match the "meet" in "meeting"? If not, you might like to try something like this:

>>> import re
>>> query = ("meet", "lot")
>>> text = "I'll meet a lot of friends including Charlotte at the town meeting"
>>> regex = r"\b(" + "|".join(query) + r")\b"
>>> re.findall(regex, text, re.IGNORECASE)
['meet', 'lot']
>>>

The \b at each end forces it to match only at word boundaries, using re's definition of "word" -- "isn't" isn't a word, it's two words separated by an apostrophe. If you don't like that, look at the nltk package.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.