2

Forgive me if this has been asked a billion times -- what are the available options for parsing HTML in Python, specifically I'm dealing with some legacy sites which have a lot of errors. Are there any parsers that are really fault tolerant?

1

1 Answer 1

3

In my experience, among many python xml/html libs, Beautiful Soup is really good at processing broken HTML.

Raw:

<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>

Parsed with BeautifulSoup:

 <i>This 
  <span title="a">is
   <br /> some 
   <html>invalid HTML. 
    <sarcasm>It's so great!
    </sarcasm>
   </html>
  </span>
 </i>
Sign up to request clarification or add additional context in comments.

1 Comment

Awesome, this looks like it will do the trick.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.