I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:
from xml.dom.minidom import parse
page = urllib2.urlopen(page_url)
parser = parse(page)
The problem is that a lot of pages have mismatched tags and special characters so the parse method raises error. Also it raises error if there is any <br> and not <br />...
I tried like this:
from xml.dom.minidom import parseString
page = urllib2.urlopen(page_url)
data = ""
for line in page.readlines():
data += str(line.replace("<br>", "<br />").replace(OTHER).replace...)
parser = parse(data)
But, this is just not a good solution.
So, is there any lib that is not so sensitive to mismatched tags and other errors in html code?