python xml.dom parsing problems

Question

I am writing a program in which the first step takes an URL address and opens the page. Then it puts the content into the xml.dom.minidom parser:

from xml.dom.minidom import parse

page = urllib2.urlopen(page_url)
parser = parse(page)

The problem is that a lot of pages have mismatched tags and special characters so the parse method raises error. Also it raises error if there is any <br> and not <br />...

I tried like this:

from xml.dom.minidom import parseString

page = urllib2.urlopen(page_url)
data = ""
for line in page.readlines():
    data += str(line.replace("<br>", "<br />").replace(OTHER).replace...)
parser = parse(data)

But, this is just not a good solution.

So, is there any lib that is not so sensitive to mismatched tags and other errors in html code?

Zach Kelling · Accepted Answer · 2011-08-24 15:57:13Z

2

I prefer lxml.html, it's very robust, and lxml in general is quite fast and has very nice capabilities, including XPath support.

import lxml.html

doc = lxml.html.parse('http://example.com')

answered Aug 24, 2011 at 15:57

Zach Kelling

54.1k15 gold badges112 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

karantan Over a year ago

nop not working. o yea and lxml doesnt have .html package (its only lxml.parse)

Zach Kelling Over a year ago

Maybe you are using an older version? Because it certainly does.

Collectives™ on Stack Overflow

python xml.dom parsing problems

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related