1

I have gone through many of the regex questions on here and used the advice in them, but can't seem to get my code to run still. I have a list of strings, and I am attempting to find the entries in this list that contain one of the following patterns:

  • a BLANK of a BLANK
  • an BLANK of an BLANK
  • a BLANK of an BLANK
  • an BLANK of a BLANK
  • that BLANK of a BLANK
  • that BLANK of an BLANK
  • the BLANK of a BLANK
  • the BLANK of an BLANK

For example, I should be able to find sentences that contain phrases like "an idiot of a doctor" or "the hard-worker of a student."

Once found, I want to make a list of the sentences that satisfy this criteria. So far, this is my code:

for sentence in sentences:
    matched = re.search(r"a [.*]of a " \
                        r"an [.*]of an " \
                        r"a [.*]of an" \
                        r"an [.*]of a " \
                        r"that [.*]of a " \
                        r"that [.*]of an " \
                        r"the [.*]of a " \
                        r"the [.*]of an ", sentence)
    if matched:
        bnp.append(matched)

#Below two lines for testing purposes only
print(matched)
print(bnp)

This code turns up no results, despite the fact that there are phrases that should satisfy the criteria in the list.

3
  • Why do you write this kind of things: [.*], take the time to read regex tutorial before, don't try random things. Commented Jan 17, 2017 at 20:14
  • I thought that [.*] would let me search for a substring of any length with any characters- did I misunderstand this? Commented Jan 17, 2017 at 20:17
  • brackets are used to match single characters, use (.*) instead Commented Jan 17, 2017 at 20:31

2 Answers 2

1

[.*] is a character class, so you are asking regex to actually match the dot or star character, quoting from re's docs:

[]

Used to indicate a set of characters. In a set:

Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

...

So, here is one way to do it:

(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*

This expression will try to match either the, that , a or an, then any character up to there is either a or an.

Here in this link, there is a demonstration of it's process.

And here is the actual demonstration:

>>> import re
>>>
>>> regex = r"(th(at|e)|a[n]?)\b.*\b(a[n]?)\b.*"
>>> test_str = ("an idiot of a doctor\n"
    "the hard-worker of a student.\n"
    "an BLANK of an BLANK\n"
    "a BLANK of an BLANK\n"
    "an BLANK of a BLANK\n"
    "that BLANK of a BLANK\n"
    "the BLANK of a BLANK\n"
    "the BLANK of an BLANK\n")
>>>
>>> matches =  re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE) 
>>> 
>>> for m in matches:
        print(m.group())


an idiot of a doctor
the hard-worker of a student.
an BLANK of an BLANK
a BLANK of an BLANK
an BLANK of a BLANK
that BLANK of a BLANK
the BLANK of a BLANK
the BLANK of an BLANK
Sign up to request clarification or add additional context in comments.

Comments

1

As it stands, this code will concatenate your pattern parameters into one long string with no operators between them. So in effect you are searching for the regex "a [.*]of a an [.*]of an a [.*]of an ..."

You are missing the 'or' operator: |. A simpler regex to accomplish this task would be something like:

(a|an|that|the) \b.*\b of (a|an) \b.*\b

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.