2

How to get a value of nested <b> HTML tag in Python using regular expressions?

<a href="/model.xml?hid=90971&amp;modelid=4636873&amp;show-uid=678650012772883921" class="b-offers__name"><b>LG</b> X110</a>

# => LG X110

5 Answers 5

7

You don't.

Regular Expressions are not well suited to deal with the nested structure of HTML. Use an HTML parser instead.

Sign up to request clarification or add additional context in comments.

Comments

6

Don't use regular expressions for parsing HTML. Use an HTML parser like BeautifulSoup. Just look how easy it is:

from BeautifulSoup import BeautifulSoup
html = r'<a href="removed because it was too long"><b>LG</b> X110</a>'
soup = BeautifulSoup(html)
print ''.join(soup.findAll(text=True))
# LG X110

Comments

1

Your question was very hard to understand, but from the given output example it looks like you want to strip everything within < and > from the input text. That can be done like so:

import re
input_text = '<a bob>i <b>c</b></a>'
output_text = re.sub('<[^>]*>', '', input_text)
print output_text

Which gives you:

i c

If that is not what you want, please clarify.

Please note that the regular expression approach for parsing XML is very brittle. For instance, the above example would break on the input <a name="b>c">hey</a>. (> is a valid character in a attribute value: see XML specs)

Comments

1

Try this...

<a.*<b>(.*)</b>(.*)</a>

$1 and $2 should be what you want, or whatever means Python has for printing captured groups.

Comments

0

+1 for Jens's answer. lxml is a good library you can use to actually parse this in a robust fashion. If you'd prefer something in the standard library, you can use sax, dom or elementree.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.