How to parse malformed HTML in python

Question

I need to browse the DOM tree of a parsed HTML document.

I'm using uTidyLib before parsing the string with lxml

a = tidy.parseString(html_code, options) dom = etree.fromstring(str(a))

sometimes I get an error, it seems that tidylib is not able to repair malformed html.

how can I parse every HTML file without getting an error (parsing only some parts of files that can not be repaired)?

dbr · Accepted Answer · 2009-05-25 02:02:00Z

27

Beautiful Soup does a good job with invalid/broken HTML

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<htm@)($*><body><table <tr><td>hi</tr></td></body><html")
>>> print soup.prettify()
<htm>
 <body>
  <table>
   <tr>
    <td>
     hi
    </td>
   </tr>
  </table>
 </body>
</htm>

edited May 25, 2009 at 2:02

answered May 24, 2009 at 21:06

dbr

171k69 gold badges284 silver badges348 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tripleee · Accepted Answer · 2013-10-03 06:35:17Z

13

Since you are already using lxml, have you tried lxml's ElementSoup module?

If ElementSoup can't repair the HTML then you'll probably need to apply your own filters first that are based on your own observations of how the data is broken.

edited Oct 3, 2013 at 6:35

tripleee

192k37 gold badges318 silver badges367 bronze badges

answered May 24, 2009 at 22:52

Van Gale

44k9 gold badges75 silver badges81 bronze badges

2 Comments

tripleee Over a year ago

Links were broken; edited them. Hopefully the new locations contain the same content that you were originally pointing to.

BobTuckerman Over a year ago

If you don't have beautiful soup installed, you may need it for Element Soup. Just do pip install beautifulsoup

Collectives™ on Stack Overflow

How to parse malformed HTML in python

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related