Searching List of Strings Using Regex to Find Substrings Python

Question

I have gone through many of the regex questions on here and used the advice in them, but can't seem to get my code to run still. I have a list of strings, and I am attempting to find the entries in this list that contain one of the following patterns:

a BLANK of a BLANK
an BLANK of an BLANK
a BLANK of an BLANK
an BLANK of a BLANK
that BLANK of a BLANK
that BLANK of an BLANK
the BLANK of a BLANK
the BLANK of an BLANK

For example, I should be able to find sentences that contain phrases like "an idiot of a doctor" or "the hard-worker of a student."

Once found, I want to make a list of the sentences that satisfy this criteria. So far, this is my code:

for sentence in sentences:
    matched = re.search(r"a [.*]of a " \
                        r"an [.*]of an " \
                        r"a [.*]of an" \
                        r"an [.*]of a " \
                        r"that [.*]of a " \
                        r"that [.*]of an " \
                        r"the [.*]of a " \
                        r"the [.*]of an ", sentence)
    if matched:
        bnp.append(matched)

#Below two lines for testing purposes only
print(matched)
print(bnp)

This code turns up no results, despite the fact that there are phrases that should satisfy the criteria in the list.

Why do you write this kind of things: [.*], take the time to read regex tutorial before, don't try random things. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Jan 17, 2017 at 20:14
I thought that [.*] would let me search for a substring of any length with any characters- did I misunderstand this? — K. Swan
– K. Swan, Commented Jan 17, 2017 at 20:17
brackets are used to match single characters, use (.*) instead — Navidad20
– Navidad20, Commented Jan 17, 2017 at 20:31

Iron Fist · Accepted Answer · 2017-01-17 20:46:44Z

[.*] is a character class, so you are asking regex to actually match the dot or star character, quoting from re's docs:

[]

Used to indicate a set of characters. In a set:

Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

...

So, here is one way to do it:

(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*

This expression will try to match either the, that , a or an, then any character up to there is either a or an.

Here in this link, there is a demonstration of it's process.

And here is the actual demonstration:

>>> import re
>>>
>>> regex = r"(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*"
>>> test_str = ("an idiot of a doctor\n"
    "the hard-worker of a student.\n"
    "an BLANK of an BLANK\n"
    "a BLANK of an BLANK\n"
    "an BLANK of a BLANK\n"
    "that BLANK of a BLANK\n"
    "the BLANK of a BLANK\n"
    "the BLANK of an BLANK\n")
>>>
>>> matches =  re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE) 
>>> 
>>> for m in matches:
        print(m.group())


an idiot of a doctor
the hard-worker of a student.
an BLANK of an BLANK
a BLANK of an BLANK
an BLANK of a BLANK
that BLANK of a BLANK
the BLANK of a BLANK
the BLANK of an BLANK

Chris · Accepted Answer · 2017-01-17 20:38:28Z

1

As it stands, this code will concatenate your pattern parameters into one long string with no operators between them. So in effect you are searching for the regex "a [.*]of a an [.*]of an a [.*]of an ..."

You are missing the 'or' operator: |. A simpler regex to accomplish this task would be something like:

(a|an|that|the) \b.*\b of (a|an) \b.*\b

answered Jan 17, 2017 at 20:38

Chris

2984 silver badges8 bronze badges

Collectives™ on Stack Overflow

Searching List of Strings Using Regex to Find Substrings Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related