0

I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:

from xml.dom.minidom import parse

page = urllib2.urlopen(page_url)
parser = parse(page)

The problem is that a lot of pages have mismatched tags and special characters so the parse method raises error. Also it raises error if there is any <br> and not <br />...

I tried like this:

from xml.dom.minidom import parseString

page = urllib2.urlopen(page_url)
data = ""
for line in page.readlines():
    data += str(line.replace("<br>", "<br />").replace(OTHER).replace...)
parser = parse(data)

But, this is just not a good solution.

So, is there any lib that is not so sensitive to mismatched tags and other errors in html code?

1 Answer 1

2

I prefer lxml.html, it's very robust, and lxml in general is quite fast and has very nice capabilities, including XPath support.

import lxml.html

doc = lxml.html.parse('http://example.com')
Sign up to request clarification or add additional context in comments.

2 Comments

nop not working. o yea and lxml doesnt have .html package (its only lxml.parse)
Maybe you are using an older version? Because it certainly does.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.