1

Consider the following html input:

<html>
<head>
<script>
function open_tools(tool_div)
{
  document.getElementById("tool1").innerHTML = "<a href='javascript:void(0);' onclick=\"javascript:clos_tools('""');\"><img src='menu.gif' border='0' /></a>";
  document.getElementById("tool").innerHTML  = "<a href='javascript:void(0);' onclick=\"javascript:open_tools('""');\"><img src='plus.gif' border='0' /></a>";
}
</script>
</head>
<body /> 
</html>

For quick testing, assume you dump this html data in 'test.html' On python shell,

>>> f = open('test.html', 'r')
>>> data = f.read()
>>> from HTMLParser import HTMLParser
>>> p = HTMLParser()
>>> p.feed(data)

Burrrppp... with following error

  File "lib\HTMLParser.py", line 155, in goahead
    k = self.parse_starttag(i)   File "lib\HTMLParser.py", line 235, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)   File "lib\HTMLParser.py", line 319, in check_for_whole_start_tag
    self.error("malformed start tag")   File "lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 7, column 88

I was confused with this error for past 6 hours. This is what I found inside HTMLParser.py code:

While parsing, when it encounters script tag, it sets cdata = true. After that, it uses interesting_cdata= re.compile(r'<(/|\Z)') regular expression to find out end of script tag [inside goahead()]

Unfortunately, it seems it is finding end of script tag in </a> of first statement of function open_tools instead of at </script>. And then it burrps in second line of function.

I do not how to fix this and thought of bug in HTMLParser is disturbing. Can anyone help?

Note: I am a python amateur and tested above with python 2.6 (windows)

Edit: Yes, it works with BeautifulSoup. But I am interested in knowing if regex is broken (and how? and its fix) or if some other problem with HTMLParser class. Getting stuck at first step with library code is discouraging. Good thing about php docs is ability to comment on official docs page. same was supported on msdn as well.

4
  • 1
    Not sure about that error - but have you tried using BeautifulSoup? Commented Dec 24, 2011 at 21:13
  • I can parse that snippet perfectly with an HTMLParser in Python 2.7.2. Commented Dec 24, 2011 at 21:23
  • 2
    @larsmans: I've tried it on 2.5-2.7, 3.0-3.3 It works only on 3.3 (cpython-e0df57330b83) e.g., 3.1 ideone.com/x4qB3 Commented Dec 25, 2011 at 4:46
  • @J.F.Sebastian providing ideone link is nice. I will see HTMLParser source of 3.3. Thanks. Commented Dec 25, 2011 at 9:34

2 Answers 2

4

it seems it is finding end of script tag in </a> of first statement

Yes, and it is correct to do so according to the HTML4 standard.

In HTML<5 (and SGML from which this behaviour is inherited), a CDATA-element like <script> or <style> is ended by the </ (ETAGO) sequence. It is an error for that sequence not to be part of a matching end-tag.

So to validate as HTML4 one must ensure no </ sequences are contained in script blocks. (The easiest way of doing that if it's your own code is to write them as JS string literal escapes like <\/ or \x3C/. But if it's your own code you'll want to look at using DOM methods instead, to avoid all the escaping problems.)

In HTML5 this is changed so that only the matching end-tag ends a CDATA block. This more closely matches traditional browser behaviour. If you use an HTML5 parser such as html5lib's you'll be OK.

Sign up to request clarification or add additional context in comments.

1 Comment

D'oh, html4!! I download the page and feed html data to parser, so I can make the suggested replacement for JS. But now I have moved to BeautifulSoup and it is behaving correctly. Wish your comment could be mentioned on python official docs for HTMLParser.
2

The title of the HTMLParser module docs says it all:

HTMLParser — Simple HTML and XHTML parser

where "simple" really does mean simple.

If you want to do any serious html parsing, use BeautifulSoup or lxml.

EDIT

To answer the specific question regarding the error:

It appears to be related to the bug reported in issue 13358, a fix for which should be included in the next release of Python 2.7 and 3.2.

(I still stand by my statements above, though ;-)

2 Comments

html5lib also. It can interoperate with lxml.
@ekhumoro :D I feel my html snippet is also simple enough. I would be most happy to know why that regex is broken (and also, is it really broken or is there some other bug eg. in html code?).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.