Robustly Parsing HTML in Python [duplicate]

Question

Forgive me if this has been asked a billion times -- what are the available options for parsing HTML in Python, specifically I'm dealing with some legacy sites which have a lot of errors. Are there any parsers that are really fault tolerant?

check stackoverflow.com/q/717541/2870069, stackoverflow.com/q/6494199/2870069, stackoverflow.com/q/11709079/2870069, stackoverflow.com/q/13759158/2870069 and others — Jakob
– Jakob, Commented Oct 22, 2013 at 6:43

Leonardo.Z · Accepted Answer · 2013-10-22 05:27:02Z

3

In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.

Raw:

<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>

Parsed with BeautifulSoup:

 <i>This 
  <span title="a">is
   <br /> some 
   <html>invalid HTML. 
    <sarcasm>It's so great!
    </sarcasm>
   </html>
  </span>
 </i>

answered Oct 22, 2013 at 5:27

Leonardo.Z

9,8413 gold badges37 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2905592 Over a year ago

Awesome, this looks like it will do the trick.

Collectives™ on Stack Overflow

Robustly Parsing HTML in Python [duplicate]

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related