Forgive me if this has been asked a billion times -- what are the available options for parsing HTML in Python, specifically I'm dealing with some legacy sites which have a lot of errors. Are there any parsers that are really fault tolerant?
1 Answer
In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.
Raw:
<i>This <span title="a">is<br> some <html>invalid</htl %> HTML.
<sarcasm>It's so great!</sarcasm>
Parsed with BeautifulSoup:
<i>This
<span title="a">is
<br /> some
<html>invalid HTML.
<sarcasm>It's so great!
</sarcasm>
</html>
</span>
</i>
1 Comment
user2905592
Awesome, this looks like it will do the trick.