Parse broken html page in python

Question

I am trying to parse a broken html page which has a comment inside anther comment and all the famous htmlparsers like beautifulsoup, lxml and HTMLParser are giving syntax errors. Following is the code. How do I ignore the part of corrupt code and parse rest of the page?

<html xmlns="http://www.w3.org/1999/xhtml"><head>

<script language="JavaScript">
<!--
     function setTimeOffsetVars (Link) { 
   // code removed
 } 

<!-- Image Preloader - takes an array of images to preload --> 
    function warningCheck(e, warnMsg) {
   // code removed
}
-->
</script>

</head>

<body topmargin="0" leftmargin="0" rightmargin="0" bottommargin="0" marginwidth="0" marginheight="0">
<!-- lot of useful code -->
</body></html>

Amadan · Accepted Answer · 2012-12-26 08:24:05Z

3

If you know what the problem is, you can preprocess: first use a primitive method like regexps to strip the offending inner comment, then hit it with a real parser.

answered Dec 26, 2012 at 8:24

Amadan

200k23 gold badges252 silver badges321 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sneawo · Accepted Answer · 2012-12-26 09:00:59Z

2

I have no errors with this html. I tried beautifulsoup4 and lxml.

from bs4 import BeautifulSoup
soup = BeautifulSoup(s)
print soup.prettify()


<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <script language="JavaScript">
   &lt;!--
     function setTimeOffsetVars (Link) { 
   // code removed
 } 

&lt;!-- Image Preloader - takes an array of images to preload --&gt; 
    function warningCheck(e, warnMsg) {
   // code removed
}
--&gt;
  </script>
 </head>
 <body bottommargin="0" leftmargin="0" marginheight="0" marginwidth="0" rightmargin="0" topmargin="0">
  <!-- lot of useful code -->
 </body>
</html>

answered Dec 26, 2012 at 9:00

sneawo

3,6321 gold badge28 silver badges31 bronze badges

Collectives™ on Stack Overflow

Parse broken html page in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related