3

I'm parsing a html document using HTMLParser and I want to print the contents between the start and end of a p tag

See my code snippet

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            print "TODO: print the contents"

3 Answers 3

8

Based on what @tauran posted, you probably want to do something like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def print_p_contents(self, html):
        self.tag_stack = []
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.tag_stack.append(tag.lower())

    def handle_endtag(self, tag):
        self.tag_stack.pop()

    def handle_data(self, data):
        if self.tag_stack[-1] == 'p':
            print data

p = MyHTMLParser()
p.print_p_contents('<p>test</p>')

Now, you might want to push all <p> contents into a list and return that as a result or something else like that.

TIL: when working with libraries like this, you need to think in stacks!

Sign up to request clarification or add additional context in comments.

5 Comments

on a large html file I get a if self.tag_stack[-1] == 'p': IndexError: list index out of range
@MatthieuRiegler, sounds like your tag_stack is empty?
@MatthieuRiegler, please check, and then edit the code here to this: if len(self.tag_stack) and self.tag_stack[-1] == 'p':
Remember that HTML is not XML. In particular, start and end tags need not match. For example '<ul> <li>text </ul>' (but no '</li>' necessarily). Hence the stack pop perhaps needs to continue until a match is found. But even this will not be enough for badly-formed sequences such as '<b><i>text</b></i>' (incorrect nesting).
according to the documentation of the handle_starttag method The tag argument is the name of the tag converted to lower case . There is no need to call tag.lower().
5

I extended the example from the docs:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print "Encountered the beginning of a %s tag" % tag

    def handle_endtag(self, tag):
        print "Encountered the end of a %s tag" % tag

    def handle_data(self, data):
        print "Encountered data %s" % data

p = MyHTMLParser()
p.feed('<p>test</p>')

-

Encountered the beginning of a p tag
Encountered data test
Encountered the end of a p tag

Comments

1

It did not seem to work for my code so I defined tag_stack = [] outside like a sort of global variable.

from html.parser import HTMLParser
    tag_stack = []
    class MONanalyseur(HTMLParser):

    def handle_starttag(self, tag, attrs):
        tag_stack.append(tag.lower())
    def handle_endtag(self, tag):
        tag_stack.pop()
    def handle_data(self, data):
        if tag_stack[-1] == 'head':
            print(data)

parser=MONanalyseur()
parser.feed()    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.