5

When using the HTMLParser class in Python, is it possible to abort processing within a handle_* function? Early in the processing, I get all the data I need, so it seems like a waste to continue processing. There's an example below of extracting the meta description for a document.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):

    def handle_start(self, tag, attrs):
        in_meta = False
        if tag == 'meta':
          for attr in attrs:
              if attr[0].lower() == 'name' and attr[1].lower() == 'description':
                  in_meta = True
              if attr[0].lower() == 'content':
                  print(attr[1])
                  # Would like to tell the parser to stop now,
                  # since I have all the data that I need
4
  • 2
    shouldn't you inherit HTMLParser and not Parser? Commented Jan 2, 2010 at 7:39
  • 1
    Oh, and I really recommend BeautifulSoup for parsing HTML - it's much easier to use Commented Jan 2, 2010 at 7:40
  • I think it is easy to come to a conclusion, that this is merely a typo. Because if it wasn't this code wouldn't work. Commented Jan 2, 2010 at 7:48
  • @shylent: i did not imply this is the cause of his problem - hence it's a comment, not an answer Commented Jan 2, 2010 at 7:49

3 Answers 3

10

You can raise an exception and wrap your .feed() call in a try block.

You can also call self.reset() when you decide, that you are done (I have not actually tried it, but according to documentation "Reset the instance. Loses all unprocessed data.", - this is precisely what you need).

Sign up to request clarification or add additional context in comments.

6 Comments

An exception doesn't sound like a nice idea here - exceptions should be used only for exceptional conditions, and in this case you just propose it to be used as a control-flow tool. As for the 'reset' method, I've considered it too but I can't figure out if it's really relevant here
re: "exceptions .. for exceptional conditions" - not so true for python. Do you know, that StopIteration is raised whenever an iterator "runs out of" iterations? That's not much of an "exceptional condition", now is it? In fact it is distinctly similar to the condition, that the questioner wants to handle, - a "break now" kind of condition.
@shylent: true about StopIteration, but that is rarely handled manually, but rather is wrapped so that the user almost never sees it directly. Nevertheless, you're making a good point.
I was looking for an answer to the same question and liked your solution. Tried the self.reset() first as it seemed to be a specialized solution, but it seemed to raise an exception with some funny messages like "we shouldn't be here". Now raising a custom exception with the parsed data as an argument. An additional Answer with an example on the way.
calling self.reset() from inside handle_starttag throws an Exception with text "we should not get here!" (python 3.5) I think you better heed that.
|
1

If you use pyparsing's scanString method, you have more control over how far you actually go through the input string. In your example, we create an expression that matches a <meta> tag, and add a parse action that ensures that we only match the tag with name="description". This code assumes that you have read the page's HTML into the variable htmlsrc:

from pyparsing import makeHTMLTags, withAttribute

# makeHTMLTags creates both open and closing tags, only care about the open tag
metaTag = makeHTMLTags("meta")[0]
metaTag.setParseAction(withAttribute(name="description"))

try:
    # scanString is a generator that returns each match as it is found
    # in the input
    tokens,startloc,endloc = metaTag.scanString(htmlsrc).next()

    # attributes can be accessed like object attributes if they are 
    # valid Python names
    print tokens.content

    # if the attribute name clashes with a Python keyword, or is 
    # otherwise unsuitable as an identifier, use dict-like access instead
    print tokens["content"]

except StopIteration:
    print "no matching meta tag found"

1 Comment

Thanks for the answer. I'm sure this works as well and I appreciate having somewhat of an introduction to pyparsing. I would mark both correct if I could.
1

Extending on @shylent's answer, here's my solution:

class MyParser(HTMLParser):

    boolean_flag = False

    def handle_starttag(self, tag, attrs):
        # for example:
        self.boolean_flag = (tag == "sometag" and ("id", "someid") in attrs)

    def handle_endtag(self, tag):
        pass

    def handle_data(self, data):
        if self.boolean_flag:
            raise DataParsedException(data)


class DataParsedException(Exception):
    def __init__(self, data):
        self.data = data

Usage:

try:
    parser.feed(html.decode())
except DataParsedException as dataParsed:
    vars.append(dataParsed.data)

It does the job.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.