0

I have data that looks like it is part of an HTML document. However there are some bugs in it like

<td class= foo"bar">

on which all the parsers I tried (lxml, xml.etree) fail with an error.

Since I don't actually care about this specific part of the document I am looking for a more robust parser.

Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.

2 Answers 2

1

You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.

Use a compliant HTML parser like lxml.html, or html5lib, or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.

Sign up to request clarification or add additional context in comments.

1 Comment

BeautifulSoup is also really convenient for navigating the result!
1

Use lxml:

Create a HTML parser with the recover set to True:

parser = etree.HTMLParser(recover=True)
tree   = etree.parse(StringIO(broken_html), parser)

See the tutorial Parsing XML and HTML with lxml.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.