How to match any string from a list of strings in regular expressions in python?

Question

Lets say I have a list of strings,

string_lst = ['fun', 'dum', 'sun', 'gum']

I want to make a regular expression, where at a point in it, I can match any of the strings i have in that list, within a group, such as this:

import re
template = re.compile(r".*(elem for elem in string_lst).*")
template.match("I love to have fun.")

What would be the correct way to do this? Or would one have to make multiple regular expressions and match them all separately to the string?

Join the array elements with | as glue, will form string as fun|dum|sun|gum which can be used in regex. — Tushar
– Tushar, Commented Oct 29, 2015 at 5:07
any(z in string_list for z in re.findall(r"['\w]+", 'This is just for fun')) — Burhan Khalid
– Burhan Khalid, Commented Oct 29, 2015 at 5:15
Do you care which of the strings is found, or just that any of them are found? — Burhan Khalid
– Burhan Khalid, Commented Oct 29, 2015 at 5:20
The answers are ok, but its not optimal, did you mean by your question that you want to automatically find the regular expression r"[fs]un|[dg]u[m]"? This is a very interesting question which is BTW the basis for such fields as phonology, but I need to know if you meant to solve this and such things as can you assume similar length or at least set some tradeoffs between insertion deletion and replacement, in what terms is a regexp minimal, those sort of things. — Veltzer Doron
– Veltzer Doron, Commented Jan 8, 2020 at 18:58

wjandrea · Accepted Answer · 2021-01-02 15:45:19Z

64

Join the list on the pipe character |, which represents different options in regex.

string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."

print re.findall(r"(?=("+'|'.join(string_lst)+r"))", x)

Output: ['fun']

You cannot use match as it will match from start. Using search you will get only the first match. So use findall instead.

Also use lookahead if you have overlapping matches not starting at the same point.

edited Jan 2, 2021 at 15:45

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Oct 29, 2015 at 5:12

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Marlon Abeykoon Over a year ago

But this will return ['fun'] if there is a word like funny

Marlon Abeykoon Over a year ago

Oh nice. re.findall(r"(?=\b("+'|'.join(string_lst)+r")\b)",x) It worked for me

Pranzell Over a year ago

The approach is correct but fails to do the complete thing. It will match every occurrence of the list word in a given string, even in other words which have a part of the word similar. Example, try giving, x = "I love to have funny" and check. The proper raw format would be: print(re.findall(r"(?=(\b" + '|'.join(string_lst) + r"\b))", x))

vks Over a year ago

@Pranzell i removed your edit.Please add your answer below existing one stating the condition in which it is better :)

jfs · Accepted Answer · 2015-10-29 10:19:04Z

25

regex module has named lists (sets actually):

#!/usr/bin/env python
import regex as re # $ pip install regex

p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])
if p.search("I love to have fun."):
    print('matched')

Here words is just a name, you can use anything you like instead.
.search() methods is used instead of .* before/after the named list.

To emulate named lists using stdlib's re module:

#!/usr/bin/env python
import re

words = ['fun', 'dum', 'sun', 'gum']
longest_first = sorted(words, key=len, reverse=True)
p = re.compile(r'(?:{})'.format('|'.join(map(re.escape, longest_first))))
if p.search("I love to have fun."):
    print('matched')

re.escape() is used to escape regex meta-characters such as .*? inside individual words (to match the words literally).
sorted() emulates regex behavior and it puts the longest words first among the alternatives, compare:

>>> import re
>>> re.findall("(funny|fun)", "it is funny")
['funny']
>>> re.findall("(fun|funny)", "it is funny")
['fun']
>>> import regex
>>> regex.findall(r"\L<words>", "it is funny", words=['fun', 'funny'])
['funny']
>>> regex.findall(r"\L<words>", "it is funny", words=['funny', 'fun'])
['funny']

answered Oct 29, 2015 at 10:19

jfs

417k210 gold badges1k silver badges1.7k bronze badges

4 Comments

Jean-François Fabre Over a year ago

you could add that it solves the complexity problem of the a|b|c|d ... approach (linear search)

jfs Over a year ago

@Jean-FrançoisFabre I'm not sure there has to be a difference (both interfaces could be compiled to the same linear time in the input algorithm (actual implementation may differ--in this case, benchmark it if it matters for your input)).

Lbro Dec 2, 2024 at 16:31

p = regex.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum']) instead of p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])

jfs Dec 3, 2024 at 17:37

Lbro: look at the import. It hints that if you use regex module, it is a drop-in replacement for re stdlib (no need tinyse both modules in the same module).

lord63. j · Accepted Answer · 2015-10-29 05:21:44Z

6

Except for the regular expression, you can use list comprehension, hope it's not off the topic.

import re
def match(input_string, string_list):
    words = re.findall(r'\w+', input_string)
    return [word for word in words if word in string_list]

>>> string_lst = ['fun', 'dum', 'sun', 'gum']
>>> match("I love to have fun.", string_lst)
['fun']

answered Oct 29, 2015 at 5:21

lord63. j

4,6702 gold badges24 silver badges31 bronze badges

Comments

John La Rooy · Accepted Answer · 2015-10-29 06:02:51Z

6

You should make sure to escape the strings correctly before combining into a regex

>>> import re
>>> string_lst = ['fun', 'dum', 'sun', 'gum']
>>> x = "I love to have fun."
>>> regex = re.compile("(?=(" + "|".join(map(re.escape, string_lst)) + "))")
>>> re.findall(regex, x)
['fun']

answered Oct 29, 2015 at 6:02

John La Rooy

306k54 gold badges378 silver badges513 bronze badges

1 Comment

ZZZ Over a year ago

is there a way to use re.search instead of re.findall here. i tired using re.search, and I got this bad output <re.Match object; span=(15, 15), match=''>

evandrix · Accepted Answer · 2021-12-12 03:47:35Z

5

In line with @vks reply - I feel this actually does the complete task...

finds = re.findall(r"(?=(\b" + '\\b|\\b'.join(string_lst) + r"\b))", x)

Adding word boundary completes the task!

edited Dec 12, 2021 at 3:47

evandrix

6,2464 gold badges30 silver badges38 bronze badges

answered Apr 22, 2020 at 13:22

Pranzell

2,53520 silver badges23 bronze badges

Collectives™ on Stack Overflow

How to match any string from a list of strings in regular expressions in python?

5 Answers 5

4 Comments

4 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

4 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related