I am trying to parse a broken html page which has a comment inside anther comment and all the famous htmlparsers like beautifulsoup, lxml and HTMLParser are giving syntax errors. Following is the code. How do I ignore the part of corrupt code and parse rest of the page?
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<script language="JavaScript">
<!--
function setTimeOffsetVars (Link) {
// code removed
}
<!-- Image Preloader - takes an array of images to preload -->
function warningCheck(e, warnMsg) {
// code removed
}
-->
</script>
</head>
<body topmargin="0" leftmargin="0" rightmargin="0" bottommargin="0" marginwidth="0" marginheight="0">
<!-- lot of useful code -->
</body></html>