51

Lets say I have a list of strings,

string_lst = ['fun', 'dum', 'sun', 'gum']

I want to make a regular expression, where at a point in it, I can match any of the strings i have in that list, within a group, such as this:

import re
template = re.compile(r".*(elem for elem in string_lst).*")
template.match("I love to have fun.")

What would be the correct way to do this? Or would one have to make multiple regular expressions and match them all separately to the string?

5
  • 1
    Join the array elements with | as glue, will form string as fun|dum|sun|gum which can be used in regex. Commented Oct 29, 2015 at 5:07
  • 8
    re.search('|'.join(string_lst), input_string) Commented Oct 29, 2015 at 5:08
  • any(z in string_list for z in re.findall(r"['\w]+", 'This is just for fun')) Commented Oct 29, 2015 at 5:15
  • Do you care which of the strings is found, or just that any of them are found? Commented Oct 29, 2015 at 5:20
  • The answers are ok, but its not optimal, did you mean by your question that you want to automatically find the regular expression r"[fs]un|[dg]u[m]"? This is a very interesting question which is BTW the basis for such fields as phonology, but I need to know if you meant to solve this and such things as can you assume similar length or at least set some tradeoffs between insertion deletion and replacement, in what terms is a regexp minimal, those sort of things. Commented Jan 8, 2020 at 18:58

5 Answers 5

64

Join the list on the pipe character |, which represents different options in regex.

string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."

print re.findall(r"(?=("+'|'.join(string_lst)+r"))", x)

Output: ['fun']

You cannot use match as it will match from start. Using search you will get only the first match. So use findall instead.

Also use lookahead if you have overlapping matches not starting at the same point.

Sign up to request clarification or add additional context in comments.

4 Comments

But this will return ['fun'] if there is a word like funny
Oh nice. re.findall(r"(?=\b("+'|'.join(string_lst)+r")\b)",x) It worked for me
The approach is correct but fails to do the complete thing. It will match every occurrence of the list word in a given string, even in other words which have a part of the word similar. Example, try giving, x = "I love to have funny" and check. The proper raw format would be: print(re.findall(r"(?=(\b" + '|'.join(string_lst) + r"\b))", x))
@Pranzell i removed your edit.Please add your answer below existing one stating the condition in which it is better :)
25

regex module has named lists (sets actually):

#!/usr/bin/env python
import regex as re # $ pip install regex

p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])
if p.search("I love to have fun."):
    print('matched')

Here words is just a name, you can use anything you like instead.
.search() methods is used instead of .* before/after the named list.

To emulate named lists using stdlib's re module:

#!/usr/bin/env python
import re

words = ['fun', 'dum', 'sun', 'gum']
longest_first = sorted(words, key=len, reverse=True)
p = re.compile(r'(?:{})'.format('|'.join(map(re.escape, longest_first))))
if p.search("I love to have fun."):
    print('matched')

re.escape() is used to escape regex meta-characters such as .*? inside individual words (to match the words literally).
sorted() emulates regex behavior and it puts the longest words first among the alternatives, compare:

>>> import re
>>> re.findall("(funny|fun)", "it is funny")
['funny']
>>> re.findall("(fun|funny)", "it is funny")
['fun']
>>> import regex
>>> regex.findall(r"\L<words>", "it is funny", words=['fun', 'funny'])
['funny']
>>> regex.findall(r"\L<words>", "it is funny", words=['funny', 'fun'])
['funny']

4 Comments

you could add that it solves the complexity problem of the a|b|c|d ... approach (linear search)
@Jean-FrançoisFabre I'm not sure there has to be a difference (both interfaces could be compiled to the same linear time in the input algorithm (actual implementation may differ--in this case, benchmark it if it matters for your input)).
p = regex.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum']) instead of p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])
Lbro: look at the import. It hints that if you use regex module, it is a drop-in replacement for re stdlib (no need tinyse both modules in the same module).
6

Except for the regular expression, you can use list comprehension, hope it's not off the topic.

import re
def match(input_string, string_list):
    words = re.findall(r'\w+', input_string)
    return [word for word in words if word in string_list]

>>> string_lst = ['fun', 'dum', 'sun', 'gum']
>>> match("I love to have fun.", string_lst)
['fun']

Comments

6

You should make sure to escape the strings correctly before combining into a regex

>>> import re
>>> string_lst = ['fun', 'dum', 'sun', 'gum']
>>> x = "I love to have fun."
>>> regex = re.compile("(?=(" + "|".join(map(re.escape, string_lst)) + "))")
>>> re.findall(regex, x)
['fun']

1 Comment

is there a way to use re.search instead of re.findall here. i tired using re.search, and I got this bad output <re.Match object; span=(15, 15), match=''>
5

In line with @vks reply - I feel this actually does the complete task...

finds = re.findall(r"(?=(\b" + '\\b|\\b'.join(string_lst) + r"\b))", x)

Adding word boundary completes the task!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.