Python HTMLParser

Question

I'm parsing a html document using HTMLParser and I want to print the contents between the start and end of a p tag

See my code snippet

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            print "TODO: print the contents"

Daren Thomas · Accepted Answer · 2011-08-26 12:38:35Z

8

Based on what @tauran posted, you probably want to do something like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def print_p_contents(self, html):
        self.tag_stack = []
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.tag_stack.append(tag.lower())

    def handle_endtag(self, tag):
        self.tag_stack.pop()

    def handle_data(self, data):
        if self.tag_stack[-1] == 'p':
            print data

p = MyHTMLParser()
p.print_p_contents('<p>test</p>')

Now, you might want to push all  contents into a list and return that as a result or something else like that.

TIL: when working with libraries like this, you need to think in stacks!

edited Aug 26, 2011 at 12:38

answered Aug 26, 2011 at 11:51

Daren Thomas

70.8k42 gold badges156 silver badges205 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Matthieu Riegler Over a year ago

on a large html file I get a if self.tag_stack[-1] == 'p': IndexError: list index out of range

Daren Thomas Over a year ago

@MatthieuRiegler, sounds like your tag_stack is empty?

Daren Thomas Over a year ago

@MatthieuRiegler, please check, and then edit the code here to this: if len(self.tag_stack) and self.tag_stack[-1] == 'p':

Rhubbarb Over a year ago

Remember that HTML is not XML. In particular, start and end tags need not match. For example '<ul> <li>text </ul>' (but no '</li>' necessarily). Hence the stack pop perhaps needs to continue until a match is found. But even this will not be enough for badly-formed sequences such as 'text' (incorrect nesting).

frogcoder Over a year ago

according to the documentation of the handle_starttag method The tag argument is the name of the tag converted to lower case . There is no need to call tag.lower().

tauran · Accepted Answer · 2011-08-26 11:45:00Z

5

I extended the example from the docs:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print "Encountered the beginning of a %s tag" % tag

    def handle_endtag(self, tag):
        print "Encountered the end of a %s tag" % tag

    def handle_data(self, data):
        print "Encountered data %s" % data

p = MyHTMLParser()
p.feed('<p>test</p>')

-

Encountered the beginning of a p tag
Encountered data test
Encountered the end of a p tag

answered Aug 26, 2011 at 11:45

tauran

8,0766 gold badges44 silver badges49 bronze badges

Comments

Glorfindel · Accepted Answer · 2015-07-08 17:28:59Z

1

It did not seem to work for my code so I defined tag_stack = [] outside like a sort of global variable.

from html.parser import HTMLParser
    tag_stack = []
    class MONanalyseur(HTMLParser):

    def handle_starttag(self, tag, attrs):
        tag_stack.append(tag.lower())
    def handle_endtag(self, tag):
        tag_stack.pop()
    def handle_data(self, data):
        if tag_stack[-1] == 'head':
            print(data)

parser=MONanalyseur()
parser.feed()

edited Jul 8, 2015 at 17:28

Glorfindel

22.8k13 gold badges97 silver badges124 bronze badges

answered Jul 8, 2015 at 17:08

nate

111 bronze badge

Collectives™ on Stack Overflow

Python HTMLParser

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related