matching html tag using regex in python

Question

str="<p class=\"drug-subtitle\"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"

br=re.match("<p> class=\"drug-subtitle\"[^>]*>(.*?)</p>",str)

br returns None

what is the error in the regular expression i have used?

i have no idea about DOM.can you mention the error in the code? — FathimaBeevi
– FathimaBeevi, Commented Mar 22, 2014 at 10:08

Sabuj Hassan · Accepted Answer · 2014-03-22 10:12:05Z

2

The fixed regex will be this one. Check the second line at where I have pointed and you'll find where it didn't work for you. I used findall() for easy access to all the matched group on my screen.

print re.findall('<p class="drug-subtitle"[^>]*>(.*?)</p>',input)
                    ^ you had a > character here

But, BeautifulSoup will be easy option for this kind of actions:

input='''
<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>
'''
soup = BeautifulSoup(input)
br = soup.find("p", {"class": "drug-subtitle"})
print str(br)

answered Mar 22, 2014 at 10:12

Sabuj Hassan

39.7k14 gold badges83 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

James Mills · Accepted Answer · 2014-03-22 11:27:37Z

1

I really highly recommend you use a DOM Parser library such as lxml along with for example cssselect to do this.

Example:

>>> from lxml.html import fromstring
>>> html = """<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"""
>>> doc = fromstring(html)
>>> "".join(filter(None, (e.text for e in doc.cssselect(".drug-subtitle")[0])))
'Generic Name:Brand Names:Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA'

answered Mar 22, 2014 at 11:27

James Mills

19.1k4 gold badges53 silver badges63 bronze badges

Comments

Ammar · Accepted Answer · 2014-03-22 20:03:58Z

if you got the input:

'<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>'

and you want to check if :

<p class="drug-subtitle"> .. some items here .. </p>

exists in your input, the regex to be used is:

\<p\sclass=\"drug-subtitle\"[^>]*>(.*?)\<\/p\>

description:

\< matches the character < literally
p matches the character p literally (case sensitive)
\s match any white space character [\r\n\t\f ]
class= matches the characters class= literally (case sensitive)
\" matches the character " literally
drug-subtitle matches the characters drug-subtitle literally (case sensitive)
\" matches the character " literally
[^>]* match a single character not present in the list below
    Quantifier: Between zero and unlimited times, as many times as possible,
               giving back as needed.
    > a single character in the list &gt; literally (case sensitive)
> matches the character > literally
1st Capturing group (.*?)
    .*? matches any character (except newline)
        Quantifier: Between zero and unlimited times, as few times as possible,
                    expanding as needed.
\< matches the character < literally
\/ matches the character / literally
p matches the character p literally (case sensitive)
\> matches the character > literally

so the problems in your regex are:

in < p> there should be no ">".
in < /p> you should escape the "<, / , >" characters by adding "\" before them.

hope this helped.

Collectives™ on Stack Overflow

matching html tag using regex in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related