Consider the following html input:
<html>
<head>
<script>
function open_tools(tool_div)
{
document.getElementById("tool1").innerHTML = "<a href='javascript:void(0);' onclick=\"javascript:clos_tools('""');\"><img src='menu.gif' border='0' /></a>";
document.getElementById("tool").innerHTML = "<a href='javascript:void(0);' onclick=\"javascript:open_tools('""');\"><img src='plus.gif' border='0' /></a>";
}
</script>
</head>
<body />
</html>
For quick testing, assume you dump this html data in 'test.html' On python shell,
>>> f = open('test.html', 'r')
>>> data = f.read()
>>> from HTMLParser import HTMLParser
>>> p = HTMLParser()
>>> p.feed(data)
Burrrppp... with following error
File "lib\HTMLParser.py", line 155, in goahead
k = self.parse_starttag(i) File "lib\HTMLParser.py", line 235, in parse_starttag
endpos = self.check_for_whole_start_tag(i) File "lib\HTMLParser.py", line 319, in check_for_whole_start_tag
self.error("malformed start tag") File "lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 7, column 88
I was confused with this error for past 6 hours. This is what I found inside HTMLParser.py code:
While parsing, when it encounters script tag, it sets cdata = true.
After that, it uses interesting_cdata= re.compile(r'<(/|\Z)') regular expression to find out end of script tag [inside goahead()]
Unfortunately, it seems it is finding end of script tag in </a> of first statement of function open_tools instead of at </script>. And then it burrps in second line of function.
I do not how to fix this and thought of bug in HTMLParser is disturbing. Can anyone help?
Note: I am a python amateur and tested above with python 2.6 (windows)
Edit: Yes, it works with BeautifulSoup. But I am interested in knowing if regex is broken (and how? and its fix) or if some other problem with HTMLParser class. Getting stuck at first step with library code is discouraging. Good thing about php docs is ability to comment on official docs page. same was supported on msdn as well.
HTMLParserin Python 2.7.2.