5

I am using Python's HTMLParser from html.parser module. I am looking for a single tag and when it is found it would make sense to stop the parsing. Is this possible? I tried to call close() but I am not sure if this is the way to go.

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        login_form = False
        if tag == "form":
            print("finished")
            self.close()

However this seems to have recursive effects ending with

  File "/usr/lib/python3.4/re.py", line 282, in _compile
    p, loc = _cache[type(pattern), pattern, flags]
RuntimeError: maximum recursion depth exceeded in comparison
2
  • It seems that you should call the close method of the parent class HTMLParser, but the interpreter can't solve the reference to that method. I am curious to know why this doesn't work. Commented May 17, 2015 at 16:45
  • Possible duplicate of How to tell python HTMLParser to stop Commented Apr 8, 2018 at 5:39

1 Answer 1

3

According to the docs, the close() method does this:

Force processing of all buffered data as if it were followed by an end-of-file mark.

You're still inside the handle_starttag and haven't finished working with the buffer yet, so you definitely do not want to process all the buffered data - that's why you're getting stuck with recursion. You can't stop the machine from inside the machine.

From the description of reset() this sounds more like what you want:

Reset the instance. Loses all unprocessed data.

but also this can't be called from the things which it calls, so this also shows recursion.

It sounds like you have two options:

  • raise an Exception (like for example a StopIteration) and catch it from your call to the parser. Depending on what else you're doing in the parsing this may retain the information you need. You may need to do some checks to see that files aren't left open.

  • use a simple flag (True / False) to signify whether you have aborted or not. At the very start of handle_starttag just exit if aborted. So the machinery will still go through all the tags of the html, but do nothing for each one. Obviously if you're processing handle_endtag as well then this would also check the flag. You can set the flag back to normal either when you receive a <html> tag or by overwriting the feed method.

Sign up to request clarification or add additional context in comments.

1 Comment

can you explain the solution with a rough code snippet? @Constance

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.