0

I have a loop running which picks values of countries one by one from a list. For current iteration, say x_3 = "United Kingdom' . Now, I want to search x_3 in a text txt_to_srch keeping in mind that 'United Kingdom' can be 'United Kingdom'(more than one space) or '\nUnited Kingdom\n' in the text. The word 'United Kingdom is present in txt_to_srch .

I have used the following code:

x_3 = '\s+'.join(x_3.split(" "))
x_3 = r"\b" + re.escape(x_3)+r"\b"
x2 = re.compile(x_3,re.IGNORECASE)
txt_to_srch = re.sub(r'\n',' ',txt_to_srch)
txt_to_srch = re.sub(r'\r',' ',txt_to_srch)
txt_to_srch = re.sub(r'\t',' ',txt_to_srch)
y = re.findall(x2,txt_to_srch)

However, I am getting y as empty list.

1 Answer 1

1

Don't use re.escape that adds unwanted backslashes:

re.escape(pattern)

Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

Using re.escape on your first regex turns it into United\\s\+Kingdom, which will try to match a litteral \ followed by an s between United and Kingdom.

Without it, your code works as expected:

import re

x_3 = "United Kingdom"

txt_to_srch = """Monty Pythons come from United Kingdom. They do.
United Kingdom is their home. Yes.
United Kingdom"""

x_3 = '\s+'.join(x_3.split(" "))
x_3 = r"\b" + x_3 +r"\b"
# print(x_3)
# \bUnited\s+Kingdom\bx2 = re.compile(x_3,re.IGNORECASE)
txt_to_srch = re.sub(r'\n',' ',txt_to_srch)
txt_to_srch = re.sub(r'\r',' ',txt_to_srch)
txt_to_srch = re.sub(r'\t',' ',txt_to_srch)
y = re.findall(x2,txt_to_srch)

print(y)
# ['United Kingdom', 'United Kingdom', 'United Kingdom']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.