1
str="<p class=\"drug-subtitle\"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"

br=re.match("<p> class=\"drug-subtitle\"[^>]*>(.*?)</p>",str)

br returns None

what is the error in the regular expression i have used?

2
  • 2
    Do not use regex. Go through the DOM Commented Mar 22, 2014 at 10:05
  • i have no idea about DOM.can you mention the error in the code? Commented Mar 22, 2014 at 10:08

3 Answers 3

2

The fixed regex will be this one. Check the second line at where I have pointed and you'll find where it didn't work for you. I used findall() for easy access to all the matched group on my screen.

print re.findall('<p class="drug-subtitle"[^>]*>(.*?)</p>',input)
                    ^ you had a > character here

But, BeautifulSoup will be easy option for this kind of actions:

input='''
<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>
'''
soup = BeautifulSoup(input)
br = soup.find("p", {"class": "drug-subtitle"})
print str(br)
Sign up to request clarification or add additional context in comments.

Comments

1

I really highly recommend you use a DOM Parser library such as lxml along with for example cssselect to do this.

Example:

>>> from lxml.html import fromstring
>>> html = """<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"""
>>> doc = fromstring(html)
>>> "".join(filter(None, (e.text for e in doc.cssselect(".drug-subtitle")[0])))
'Generic Name:Brand Names:Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA'

Comments

0

if you got the input:

'<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>' 

and you want to check if :

<p class="drug-subtitle"> .. some items here .. </p> 

exists in your input, the regex to be used is:

\<p\sclass=\"drug-subtitle\"[^>]*>(.*?)\<\/p\> 

description:

\< matches the character < literally
p matches the character p literally (case sensitive)
\s match any white space character [\r\n\t\f ]
class= matches the characters class= literally (case sensitive)
\" matches the character " literally
drug-subtitle matches the characters drug-subtitle literally (case sensitive)
\" matches the character " literally
[^>]* match a single character not present in the list below
    Quantifier: Between zero and unlimited times, as many times as possible,
               giving back as needed.
    > a single character in the list &gt; literally (case sensitive)
> matches the character > literally
1st Capturing group (.*?)
    .*? matches any character (except newline)
        Quantifier: Between zero and unlimited times, as few times as possible,
                    expanding as needed.
\< matches the character < literally
\/ matches the character / literally
p matches the character p literally (case sensitive)
\> matches the character > literally

so the problems in your regex are:

  1. in < p> there should be no ">".
  2. in < /p> you should escape the "<, / , >" characters by adding "\" before them.

hope this helped.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.