1

I'm trying to do a search with regex within two lists that have similar strings, but not the same, how to fix the fault below?

Script:

import re

list1 = [
'juice',
'potato']

list2 = [
'juice;44',
'potato;55',
'apple;66']

correlation = []
for a in list1:
    r = re.compile(r'\b{}\b'.format(a), re.I)
    for b in list2:
        if r.search(b):
            pass
        else:
            correlation.append(b)

print(correlation)

Output:

['potato;55', 'apple;66', 'juice;44', 'apple;66']

Desired Output:

['apple;66']

Regex:

enter image description here

7
  • You search each item of list1 in each item of list2 and if e.g. 'juice' isn't in 'potato;55' it is added to correlation. Commented Sep 11, 2020 at 0:23
  • how do you recommend doing? Commented Sep 11, 2020 at 0:25
  • 1
    Set a flag to False before the inner for-loop, set it in the inner loop to True if you found a match, after the loop add to correlation if flag is False yet. Commented Sep 11, 2020 at 0:26
  • I understand the logic, but could you please provide a snippet of code as an example? Please Commented Sep 11, 2020 at 0:28
  • 1
    The fastest gun in the West won again. You may want to look at the other, arguably better answers. Commented Sep 11, 2020 at 0:57

3 Answers 3

2

You can create a single regex pattern to match terms from list1 as whole words, and then use filter:

import re

list1 = ['juice', 'potato']
list2 = ['juice;44', 'potato;55', 'apple;66']

rx = re.compile(r'\b(?:{})\b'.format("|".join(list1)))
print( list(filter(lambda x: not rx.search(x), list2)) )
# => ['apple;66']

See the Python demo.

The regex is \b(?:juice|potato)\b, see its online demo. The \b is a word boundary, the regex matches juice or potato as whole words. filter(lambda x: not rx.search(x), list2) removes all items from list2 that match the regex.

Sign up to request clarification or add additional context in comments.

Comments

1

First, inner and outer for-loop must be swapped to make this work.

Then you can set a flag to False before the inner for-loop, set it in the inner loop to True if you found a match, after the loop add to correlation if flag is False yet.

This finally looks like:

import re

list1 = [
'juice',
'potato']

list2 = [
'juice;44',
'potato;55',
'apple;66']

correlation = []
for b in list2:
    found = False

    for a in list1:
        r = re.compile(r'\b{}\b'.format(a), re.I)
        if r.search(b):
            found = True

    if not found:
        correlation.append(b)

print(correlation)

Comments

1

Convert list1 into a single regexp that matches all the words. Then append the element of list2 if it doesn't match the regexp.

regex = re.compile(r'\b(?:' + '|'.join(re.escape(word) for word in ROE) + r')\b')
correlation = [a for a in list2 if not regex.search(a)]

3 Comments

If you use re.escape, you cannot rely on \b as word boundaries. In this case, you should use (?<!\w) and (?!\w) as word boundaries since you assume that "words" may start/end with special characters.
Good point. I just use re.escape() habitually whenever the input data is supposed to be matched literally, to prevent errors from embedded punctuation.
If the input words begin or end with punctuation, I'm not even sure what the expected result is supposed to be.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.