I want to parse HTML in python

Question

I have this small class:

class HTMLTagStripper(HTMLParser):
    def __init__(self):
       self.reset()
       self.fed = []
    def handle_data(self, data):
       self.fed.append(data)
    def handle_starttag(self, tag, attrs):
       if tag == 'a':
           return attrs[0][1]
    def get_data(self):
       return ''.join(self.fed)

parsing this HTML code:

<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>

This is the result I get: long text click here
but I want to get: long text click somelink.com

Is there a way to do this?

If there is the will... I know I will be shot at here for this suggestion, but if all you want to do is remove tags you can use a regex :-) — Simon Bergot
– Simon Bergot, Commented Jun 19, 2012 at 13:28
Please don't parse HTML with RegEx Use BeautifulSoup or another library designed for it instead. — Andy
– Andy ♦, Commented Jun 19, 2012 at 13:33

Levon · Accepted Answer · 2012-06-20 01:58:07Z

8

Take a look at BeautifulSoup .. it will do that and much more.

Or you could use regular expressions/string operations to strip out the data you want. In the long run using something like BeautifulSoup will pay off, especially if you expect to do more of this.

Here's one way to use BeautifulSoup to extract the single/only link in your HTML data (I'm not an expert with this, so there may be other, better ways - suggestions/corrections welcome).

from BeautifulSoup import BeautifulSoup
s = """<div id="footer">
       <p>long text.</p>
       <p>click <a href="somelink.com">here</a>
       </div>"""

soup = BeautifulSoup(s)
your_link = soup.find('a', href=True)['href']
print 'long text click', your_link

will print:

long text click somelink.com

edited Jun 20, 2012 at 1:58

answered Jun 19, 2012 at 13:27

Levon

144k35 gold badges205 silver badges194 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Levon Over a year ago

@user1307624 If this solved your problem please consider accepting this answer by clicking on the checkmark next to my answer. It will mark this problem as solved and reward us both with some rep points. Thanks.

coder · Accepted Answer · 2012-07-19 03:45:05Z

0

I was actually checking out this new html parser library and come up with this solution:

from htmldom import htmldom
dom = htmldom.HtmlDom().createDom( """<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>""");
nodes = dom.find( "p" ).children( all_children = True ) # this makes all text nodes to be in the set.
for node in nodes:
    if node._is( "a" ):
        print( node.attr( "href" ).strip() )
    elif node._is( "text" ):
        print( node.getNode().text, end = '', sep = ' ' )

You can download the library from Sourceforge or from python package index: HtmlDom, works on python 3.x, documentation of the library is not that good but it is understandable. Hope you like the answer:)

answered Jul 19, 2012 at 3:45

coder

16

1 Comment

coder Over a year ago

You can find documentation at Documentation

bcelary · Accepted Answer · 2012-06-19 13:52:39Z

0

This WILL NOT work for you:

x = re.compile(r'<.*?>')
stripped = x.sub('', html)

as you also would like to extract some properties (like href) from the html tags.

As Levon points out: you should go for BeautifulSoup.

edited Jun 19, 2012 at 13:52

answered Jun 19, 2012 at 13:28

bcelary

1,8371 gold badge17 silver badges17 bronze badges

1 Comment

bcelary Over a year ago

Ah, right. Thanks for pointing this out. Haven't noticed that in the question.

bruno desthuilliers · Accepted Answer · 2012-06-19 14:28:05Z

0

Replacing this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       return attrs[0][1]

With this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       value = dict(attrs).get("href", None)
       if value:
           # add extra spaces since you dont sanitize
           # them in get_data
           self.fed.append(" %s " % value)

should kind of work. Or not, depending on the html source code. That's why we have BeatifulSoup.

answered Jun 19, 2012 at 14:28

bruno desthuilliers

78.3k6 gold badges102 silver badges129 bronze badges

Collectives™ on Stack Overflow

I want to parse HTML in python

4 Answers 4

1 Comment

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related