Abort HTMLParser processing in Python

Question

When using the HTMLParser class in Python, is it possible to abort processing within a handle_* function? Early in the processing, I get all the data I need, so it seems like a waste to continue processing. There's an example below of extracting the meta description for a document.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):

    def handle_start(self, tag, attrs):
        in_meta = False
        if tag == 'meta':
          for attr in attrs:
              if attr[0].lower() == 'name' and attr[1].lower() == 'description':
                  in_meta = True
              if attr[0].lower() == 'content':
                  print(attr[1])
                  # Would like to tell the parser to stop now,
                  # since I have all the data that I need

Oh, and I really recommend BeautifulSoup for parsing HTML - it's much easier to use — Eli Bendersky
– Eli Bendersky, Commented Jan 2, 2010 at 7:40
I think it is easy to come to a conclusion, that this is merely a typo. Because if it wasn't this code wouldn't work. — shylent
– shylent, Commented Jan 2, 2010 at 7:48
@shylent: i did not imply this is the cause of his problem - hence it's a comment, not an answer — Eli Bendersky
– Eli Bendersky, Commented Jan 2, 2010 at 7:49

shylent · Accepted Answer · 2010-01-02 07:46:49Z

10

You can raise an exception and wrap your .feed() call in a try block.

You can also call self.reset() when you decide, that you are done (I have not actually tried it, but according to documentation "Reset the instance. Loses all unprocessed data.", - this is precisely what you need).

answered Jan 2, 2010 at 7:46

shylent

10.1k6 gold badges41 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Eli Bendersky Over a year ago

An exception doesn't sound like a nice idea here - exceptions should be used only for exceptional conditions, and in this case you just propose it to be used as a control-flow tool. As for the 'reset' method, I've considered it too but I can't figure out if it's really relevant here

shylent Over a year ago

re: "exceptions .. for exceptional conditions" - not so true for python. Do you know, that StopIteration is raised whenever an iterator "runs out of" iterations? That's not much of an "exceptional condition", now is it? In fact it is distinctly similar to the condition, that the questioner wants to handle, - a "break now" kind of condition.

Eli Bendersky Over a year ago

@shylent: true about StopIteration, but that is rarely handled manually, but rather is wrapped so that the user almost never sees it directly. Nevertheless, you're making a good point.

Yekhezkel Yovel Over a year ago

I was looking for an answer to the same question and liked your solution. Tried the self.reset() first as it seemed to be a specialized solution, but it seemed to raise an exception with some funny messages like "we shouldn't be here". Now raising a custom exception with the parsed data as an argument. An additional Answer with an example on the way.

panda-34 Over a year ago

calling self.reset() from inside handle_starttag throws an Exception with text "we should not get here!" (python 3.5) I think you better heed that.

|

PaulMcG · Accepted Answer · 2010-01-02 23:29:42Z

1

If you use pyparsing's scanString method, you have more control over how far you actually go through the input string. In your example, we create an expression that matches a <meta> tag, and add a parse action that ensures that we only match the tag with name="description". This code assumes that you have read the page's HTML into the variable htmlsrc:

from pyparsing import makeHTMLTags, withAttribute

# makeHTMLTags creates both open and closing tags, only care about the open tag
metaTag = makeHTMLTags("meta")[0]
metaTag.setParseAction(withAttribute(name="description"))

try:
    # scanString is a generator that returns each match as it is found
    # in the input
    tokens,startloc,endloc = metaTag.scanString(htmlsrc).next()

    # attributes can be accessed like object attributes if they are 
    # valid Python names
    print tokens.content

    # if the attribute name clashes with a Python keyword, or is 
    # otherwise unsuitable as an identifier, use dict-like access instead
    print tokens["content"]

except StopIteration:
    print "no matching meta tag found"

answered Jan 2, 2010 at 23:29

PaulMcG

64.1k16 gold badges98 silver badges135 bronze badges

1 Comment

Michael Mior Over a year ago

Thanks for the answer. I'm sure this works as well and I appreciate having somewhat of an introduction to pyparsing. I would mark both correct if I could.

Yekhezkel Yovel · Accepted Answer · 2016-05-02 06:57:48Z

1

Extending on @shylent's answer, here's my solution:

class MyParser(HTMLParser):

    boolean_flag = False

    def handle_starttag(self, tag, attrs):
        # for example:
        self.boolean_flag = (tag == "sometag" and ("id", "someid") in attrs)

    def handle_endtag(self, tag):
        pass

    def handle_data(self, data):
        if self.boolean_flag:
            raise DataParsedException(data)


class DataParsedException(Exception):
    def __init__(self, data):
        self.data = data

Usage:

try:
    parser.feed(html.decode())
except DataParsedException as dataParsed:
    vars.append(dataParsed.data)

It does the job.

answered May 2, 2016 at 6:57

Yekhezkel Yovel

2051 gold badge3 silver badges10 bronze badges

Collectives™ on Stack Overflow

Abort HTMLParser processing in Python

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related