How do I parse HTML-like with errors?

Question

I have data that looks like it is part of an HTML document. However there are some bugs in it like

<td class= foo"bar">

on which all the parsers I tried (lxml, xml.etree) fail with an error.

Since I don't actually care about this specific part of the document I am looking for a more robust parser.

Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.

Martijn Pieters · Accepted Answer · 2016-11-06 13:36:13Z

1

You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.

Use a compliant HTML parser like lxml.html, or html5lib, or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.

answered Nov 6, 2016 at 13:36

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sarien Over a year ago

BeautifulSoup is also really convenient for navigating the result!

Laurent LAPORTE · Accepted Answer · 2016-11-06 13:39:32Z

1

Use lxml:

Create a HTML parser with the recover set to True:

parser = etree.HTMLParser(recover=True)
tree   = etree.parse(StringIO(broken_html), parser)

See the tutorial Parsing XML and HTML with lxml.

answered Nov 6, 2016 at 13:39

Laurent LAPORTE

23.2k7 gold badges64 silver badges111 bronze badges

Collectives™ on Stack Overflow

How do I parse HTML-like with errors?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related