Python HTMLParser - stop parsing

Question

I am using Python's HTMLParser from html.parser module. I am looking for a single tag and when it is found it would make sense to stop the parsing. Is this possible? I tried to call close() but I am not sure if this is the way to go.

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        login_form = False
        if tag == "form":
            print("finished")
            self.close()

However this seems to have recursive effects ending with

  File "/usr/lib/python3.4/re.py", line 282, in _compile
    p, loc = _cache[type(pattern), pattern, flags]
RuntimeError: maximum recursion depth exceeded in comparison

It seems that you should call the close method of the parent class HTMLParser, but the interpreter can't solve the reference to that method. I am curious to know why this doesn't work. — user4745703
– user4745703, Commented May 17, 2015 at 16:45

Constance · Accepted Answer · 2018-03-20 13:35:04Z

3

According to the docs, the close() method does this:

Force processing of all buffered data as if it were followed by an end-of-file mark.

You're still inside the handle_starttag and haven't finished working with the buffer yet, so you definitely do not want to process all the buffered data - that's why you're getting stuck with recursion. You can't stop the machine from inside the machine.

From the description of reset() this sounds more like what you want:

Reset the instance. Loses all unprocessed data.

but also this can't be called from the things which it calls, so this also shows recursion.

It sounds like you have two options:

raise an Exception (like for example a StopIteration) and catch it from your call to the parser. Depending on what else you're doing in the parsing this may retain the information you need. You may need to do some checks to see that files aren't left open.
use a simple flag (True / False) to signify whether you have aborted or not. At the very start of handle_starttag just exit if aborted. So the machinery will still go through all the tags of the html, but do nothing for each one. Obviously if you're processing handle_endtag as well then this would also check the flag. You can set the flag back to normal either when you receive a <html> tag or by overwriting the feed method.

answered Mar 20, 2018 at 13:35

Constance

2122 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sidrah Madiha Siddiqui Over a year ago

can you explain the solution with a rough code snippet? @Constance

Collectives™ on Stack Overflow

Python HTMLParser - stop parsing

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related