python class HTMLParser incorrectly giving parse error

Question

Consider the following html input:

<html>
<head>
<script>
function open_tools(tool_div)
{
  document.getElementById("tool1").innerHTML = "<a href='javascript:void(0);' onclick=\"javascript:clos_tools('""');\"><img src='menu.gif' border='0' /></a>";
  document.getElementById("tool").innerHTML  = "<a href='javascript:void(0);' onclick=\"javascript:open_tools('""');\"><img src='plus.gif' border='0' /></a>";
}
</script>
</head>
<body /> 
</html>

For quick testing, assume you dump this html data in 'test.html' On python shell,

>>> f = open('test.html', 'r')
>>> data = f.read()
>>> from HTMLParser import HTMLParser
>>> p = HTMLParser()
>>> p.feed(data)

Burrrppp... with following error

  File "lib\HTMLParser.py", line 155, in goahead
    k = self.parse_starttag(i)   File "lib\HTMLParser.py", line 235, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)   File "lib\HTMLParser.py", line 319, in check_for_whole_start_tag
    self.error("malformed start tag")   File "lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 7, column 88

I was confused with this error for past 6 hours. This is what I found inside HTMLParser.py code:

While parsing, when it encounters script tag, it sets cdata = true. After that, it uses interesting_cdata= re.compile(r'<(/|\Z)') regular expression to find out end of script tag [inside goahead()]

Unfortunately, it seems it is finding end of script tag in </a> of first statement of function open_tools instead of at </script>. And then it burrps in second line of function.

I do not how to fix this and thought of bug in HTMLParser is disturbing. Can anyone help?

Note: I am a python amateur and tested above with python 2.6 (windows)

Edit: Yes, it works with BeautifulSoup. But I am interested in knowing if regex is broken (and how? and its fix) or if some other problem with HTMLParser class. Getting stuck at first step with library code is discouraging. Good thing about php docs is ability to comment on official docs page. same was supported on msdn as well.

Not sure about that error - but have you tried using BeautifulSoup? — BeRecursive
– BeRecursive, Commented Dec 24, 2011 at 21:13
I can parse that snippet perfectly with an HTMLParser in Python 2.7.2. — Fred Foo
– Fred Foo, Commented Dec 24, 2011 at 21:23
@larsmans: I've tried it on 2.5-2.7, 3.0-3.3 It works only on 3.3 (cpython-e0df57330b83) e.g., 3.1 ideone.com/x4qB3 — jfs
– jfs, Commented Dec 25, 2011 at 4:46
@J.F.Sebastian providing ideone link is nice. I will see HTMLParser source of 3.3. Thanks. — vivek.m
– vivek.m, Commented Dec 25, 2011 at 9:34

bobince · Accepted Answer · 2011-12-25 17:46:23Z

4

it seems it is finding end of script tag in </a> of first statement

Yes, and it is correct to do so according to the HTML4 standard.

In HTML<5 (and SGML from which this behaviour is inherited), a CDATA-element like <script> or <style> is ended by the </ (ETAGO) sequence. It is an error for that sequence not to be part of a matching end-tag.

So to validate as HTML4 one must ensure no </ sequences are contained in script blocks. (The easiest way of doing that if it's your own code is to write them as JS string literal escapes like <\/ or \x3C/. But if it's your own code you'll want to look at using DOM methods instead, to avoid all the escaping problems.)

In HTML5 this is changed so that only the matching end-tag ends a CDATA block. This more closely matches traditional browser behaviour. If you use an HTML5 parser such as html5lib's you'll be OK.

answered Dec 25, 2011 at 17:46

bobince

538k111 gold badges675 silver badges846 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

vivek.m Over a year ago

D'oh, html4!! I download the page and feed html data to parser, so I can make the suggested replacement for JS. But now I have moved to BeautifulSoup and it is behaving correctly. Wish your comment could be mentioned on python official docs for HTMLParser.

ekhumoro · Accepted Answer · 2011-12-25 15:31:39Z

2

The title of the HTMLParser module docs says it all:

HTMLParser — Simple HTML and XHTML parser

where "simple" really does mean simple.

If you want to do any serious html parsing, use BeautifulSoup or lxml.

EDIT

To answer the specific question regarding the error:

It appears to be related to the bug reported in issue 13358, a fix for which should be included in the next release of Python 2.7 and 3.2.

(I still stand by my statements above, though ;-)

edited Dec 25, 2011 at 15:31

answered Dec 24, 2011 at 21:18

ekhumoro

122k23 gold badges272 silver badges400 bronze badges

2 Comments

Francis Avila Over a year ago

html5lib also. It can interoperate with lxml.

vivek.m Over a year ago

@ekhumoro :D I feel my html snippet is also simple enough. I would be most happy to know why that regex is broken (and also, is it really broken or is there some other bug eg. in html code?).

Collectives™ on Stack Overflow

python class HTMLParser incorrectly giving parse error

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related