I'm trying to write an HTML parser using Python's html.parser.HTMLParser function and have some questions.
I'm defining a parsing class as follows:
class MyHTMLParser(HTMLParser):
def __init__(self):
self.messages = []
def handle_starttag(self, tag, attrs):
message = "Encountered a start tag: %s" % tag
print(message)
self.messages.append(message)
def handle_endtag(self, tag):
message = "Encountered an end tag: %s" % tag
print(message)
self.messages.append(message)
def handle_data(self, data):
message = "Encountered some data: %s" % data
print(message)
self.messages.append(message)
parser = MyHTMLParser()
html_parser.feed("<html><head><title>Test</title></head>")
print(html_parser.messages)
and I want to store the results of the handle_data() function but cannot get handle_data() to return anything other than None, and when I try to store the results of handle_* in the self.message attribute I get the following error:
Traceback (most recent call last): File "./parse_html.py", line 33, in html_parser.feed("Test") File "/opt/local/depot/python/3.6.4/lib/python3.6/html/parser.py", line 110, in feed self.rawdata = self.rawdata + data AttributeError: 'MyHTMLParser' object has no attribute 'rawdata'
I could always make "messages" into a global variable but I'm looking for another way of storing the results of the "handle_*" functions. What's the recommended way of retrieving the list of all the elements found by the handle_data() call?
Thank you for any hints,
Catherine