0

I have this small class:

class HTMLTagStripper(HTMLParser):
    def __init__(self):
       self.reset()
       self.fed = []
    def handle_data(self, data):
       self.fed.append(data)
    def handle_starttag(self, tag, attrs):
       if tag == 'a':
           return attrs[0][1]
    def get_data(self):
       return ''.join(self.fed)

parsing this HTML code:

<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>

This is the result I get: long text click here
but I want to get: long text click somelink.com

Is there a way to do this?

2
  • If there is the will... I know I will be shot at here for this suggestion, but if all you want to do is remove tags you can use a regex :-) Commented Jun 19, 2012 at 13:28
  • 4
    Please don't parse HTML with RegEx Use BeautifulSoup or another library designed for it instead. Commented Jun 19, 2012 at 13:33

4 Answers 4

8

Take a look at BeautifulSoup .. it will do that and much more.

Or you could use regular expressions/string operations to strip out the data you want. In the long run using something like BeautifulSoup will pay off, especially if you expect to do more of this.

Here's one way to use BeautifulSoup to extract the single/only link in your HTML data (I'm not an expert with this, so there may be other, better ways - suggestions/corrections welcome).

from BeautifulSoup import BeautifulSoup
s = """<div id="footer">
       <p>long text.</p>
       <p>click <a href="somelink.com">here</a>
       </div>"""

soup = BeautifulSoup(s)
your_link = soup.find('a', href=True)['href']
print 'long text click', your_link

will print:

long text click somelink.com

Sign up to request clarification or add additional context in comments.

1 Comment

@user1307624 If this solved your problem please consider accepting this answer by clicking on the checkmark next to my answer. It will mark this problem as solved and reward us both with some rep points. Thanks.
0

I was actually checking out this new html parser library and come up with this solution:

from htmldom import htmldom
dom = htmldom.HtmlDom().createDom( """<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>""");
nodes = dom.find( "p" ).children( all_children = True ) # this makes all text nodes to be in the set.
for node in nodes:
    if node._is( "a" ):
        print( node.attr( "href" ).strip() )
    elif node._is( "text" ):
        print( node.getNode().text, end = '', sep = ' ' )

You can download the library from Sourceforge or from python package index: HtmlDom, works on python 3.x, documentation of the library is not that good but it is understandable. Hope you like the answer:)

1 Comment

You can find documentation at Documentation
0

This WILL NOT work for you:

x = re.compile(r'<.*?>')
stripped = x.sub('', html)

as you also would like to extract some properties (like href) from the html tags.

As Levon points out: you should go for BeautifulSoup.

1 Comment

Ah, right. Thanks for pointing this out. Haven't noticed that in the question.
0

Replacing this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       return attrs[0][1]

With this:

def handle_starttag(self, tag, attrs):
   if tag == 'a':
       value = dict(attrs).get("href", None)
       if value:
           # add extra spaces since you dont sanitize
           # them in get_data
           self.fed.append(" %s " % value)

should kind of work. Or not, depending on the html source code. That's why we have BeatifulSoup.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.