Python regex matching multiple words from a list

Question

I have a list of words and a string and would like to get back a list of words from the original list which are found in the string.

Ex:

import re

lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'

pattern = re.compile(r"(?=(\b" + r"\b|".join(map(re.escape, lof_terms)) + r"\b))")
found_terms = re.findall(pattern, str_content)

This will only return ['car', 'popular']. It fails to catch 'car manufacturer'. However it will catch it if I change the source list of terms to lof_terms = ['car manufacturer', 'popular']

Somehow the overlapping between 'car' and 'car manufacturer' seems to be source of this issue.

Any ideas how to get over this?

Many thanks

is regex must?

Mr. Hobo
– Mr. Hobo

2020-12-14 14:10:53 +00:00
Commented Dec 14, 2020 at 14:10 — Mr. Hobo
– Mr. Hobo, Commented Dec 14, 2020 at 14:10

Wiktor Stribiżew · Accepted Answer · 2020-12-14 14:12:13Z

2

The current code can be fixed if you first sort the lof_terms by length in the descending order:

rx = r"(?=\b({})\b)".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)

Note that in this case, \b word boundaries are only used once on either end of the grouping, no need to repeat them around each alternative. See this regex demo.

See the Python demo:

import re

lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'

rx = r"(?=\b({})\b)".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)
found_terms = re.findall(pattern, str_content)
print(found_terms)
# => ['popular', 'car manufacturer']

answered Dec 14, 2020 at 14:12

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

davep Over a year ago

I've implemented your solution and it indeed does the job. Thank you very much for you help.

Collectives™ on Stack Overflow

Python regex matching multiple words from a list

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related