2

I have a list of words and a string and would like to get back a list of words from the original list which are found in the string.

Ex:

import re

lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'

pattern = re.compile(r"(?=(\b" + r"\b|".join(map(re.escape, lof_terms)) + r"\b))")
found_terms = re.findall(pattern, str_content)

This will only return ['car', 'popular']. It fails to catch 'car manufacturer'. However it will catch it if I change the source list of terms to lof_terms = ['car manufacturer', 'popular']

Somehow the overlapping between 'car' and 'car manufacturer' seems to be source of this issue.

Any ideas how to get over this?

Many thanks

1
  • is regex must? Commented Dec 14, 2020 at 14:10

1 Answer 1

2

The current code can be fixed if you first sort the lof_terms by length in the descending order:

rx = r"(?=\b({})\b)".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)

Note that in this case, \b word boundaries are only used once on either end of the grouping, no need to repeat them around each alternative. See this regex demo.

See the Python demo:

import re

lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'

rx = r"(?=\b({})\b)".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)
found_terms = re.findall(pattern, str_content)
print(found_terms)
# => ['popular', 'car manufacturer']
Sign up to request clarification or add additional context in comments.

1 Comment

I've implemented your solution and it indeed does the job. Thank you very much for you help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.