Extracting some HTML tag values in Python

Question

How to get a value of nested <b> HTML tag in Python using regular expressions?

<a href="/model.xml?hid=90971&amp;modelid=4636873&amp;show-uid=678650012772883921" class="b-offers__name"><b>LG</b> X110</a>

# => LG X110

Jens · Accepted Answer · 2010-06-23 12:17:17Z

7

You don't.

Regular Expressions are not well suited to deal with the nested structure of HTML. Use an HTML parser instead.

edited Jun 23, 2010 at 12:17

answered Jun 23, 2010 at 10:44

Jens

25.7k9 gold badges80 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dzinx · Accepted Answer · 2010-06-23 10:59:43Z

6

Don't use regular expressions for parsing HTML. Use an HTML parser like BeautifulSoup. Just look how easy it is:

from BeautifulSoup import BeautifulSoup
html = r'<a href="removed because it was too long"><b>LG</b> X110</a>'
soup = BeautifulSoup(html)
print ''.join(soup.findAll(text=True))
# LG X110

answered Jun 23, 2010 at 10:59

Dzinx

58.2k10 gold badges63 silver badges78 bronze badges

Comments

Deestan · Accepted Answer · 2010-06-23 11:15:27Z

1

Your question was very hard to understand, but from the given output example it looks like you want to strip everything within < and > from the input text. That can be done like so:

import re
input_text = '<a bob>i <b>c</b></a>'
output_text = re.sub('<[^>]*>', '', input_text)
print output_text

Which gives you:

i c

If that is not what you want, please clarify.

Please note that the regular expression approach for parsing XML is very brittle. For instance, the above example would break on the input <a name="b>c">hey</a>. (> is a valid character in a attribute value: see XML specs)

edited Jun 23, 2010 at 11:15

answered Jun 23, 2010 at 10:49

Deestan

17.3k4 gold badges35 silver badges48 bronze badges

Comments

Adrian Regan · Accepted Answer · 2010-06-23 11:27:18Z

1

Try this...

<a.*<b>(.*)</b>(.*)</a>

$1 and $2 should be what you want, or whatever means Python has for printing captured groups.

edited Jun 23, 2010 at 11:27

answered Jun 23, 2010 at 10:48

Adrian Regan

2,25013 silver badges11 bronze badges

Comments

Noufal Ibrahim · Accepted Answer · 2010-06-23 10:54:13Z

0

+1 for Jens's answer. lxml is a good library you can use to actually parse this in a robust fashion. If you'd prefer something in the standard library, you can use sax, dom or elementree.

answered Jun 23, 2010 at 10:54

Noufal Ibrahim

73.2k13 gold badges140 silver badges174 bronze badges

Collectives™ on Stack Overflow

Extracting some HTML tag values in Python

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related