How to parse malformed HTML in python, using standard libraries

Question

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

Requirements:

Use only Python standard library components (any 2.x version)
DOM support
Handle HTML entities ( )
Handle partial documents (like: Hello, <i>World</i>!)

Bonus points:

XPATH support
Handle unclosed/malformed tags. (<big>does anyone here know <html ???

Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...

from xml.etree.ElementTree import fromstring
DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))

I don't get it. Are you expecting us to do what? You know that there is no such module in stdlib. What is your question? — SilentGhost
– SilentGhost, Commented Apr 20, 2010 at 17:04
@bukzor: I think you're misunderstanding the idea behind the stdlib. — SilentGhost
– SilentGhost, Commented Apr 20, 2010 at 17:06
@bukzor: If you can get 90% of the way there with std. libs, point out some explicit examples of what you are unable to do. If you work somewhere where you can easily pass along Python scripts, your audience shouldn't fret too much at the 15 seconds it takes to install a nice packaged library, especially if you have it downloaded to your intranet and provide a handy-dandy link in the email. If you're being a sysadmin, maybe repackage a bunch of useful ones and push them out? — Nick T
– Nick T, Commented Apr 20, 2010 at 17:28
@SilentGhost: A common python motto is 'batteries included', meaning that you should be able to do most tasks using the stdlib. Maybe HTML DOM is not one of those things. That's what this question is trying to clarify. — bukzor
– bukzor, Commented Apr 20, 2010 at 17:34
@buzkor: As mikerobi pointed out, the BeautifulSoup source is really small, so if you really want a single-file script with no 3P dependencies, copy-paste sounds like your best bet, and just skip trying to stitch together some stdlibs. — Nick T
– Nick T, Commented Apr 20, 2010 at 17:35

Ian Bicking · Accepted Answer · 2010-04-21 06:18:17Z

48

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

answered Apr 21, 2010 at 6:18

Ian Bicking

9,9406 gold badges36 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

bukzor Over a year ago

Great answer. Thanks! I don't have enough rep to uprate you. QQ I wish people weren't so touchy about hard questions. The good scientist seeks negative experiments as well..

bukzor Over a year ago

@Ian Bicking: finally got enough rep to bump you. Just to confirm, there's no known way to get ElementTree (as it exists in the stdlib) to parse real-world HTML?

Ian Bicking Over a year ago

You can have BeautifulSoup (with ElementSoup) or html5lib parse the HTML and generate an ElementTree structure, but ElementTree itself definitely cannot parse HTML.

bukzor Over a year ago

With some finagling and a little bit of HTML-correction, I've gotten ElementTree to parse all of RosettaCode.org. The most annoying part is adding all the html entities to the parser by hand. There's even an option for this in the etree docs, but it's unimplemented for undocumented reasons. You can see the code here: bukzor.hopto.org/svn/software/python/rosetta_pylint.py

David Z · Accepted Answer · 2010-04-20 16:42:21Z

5

Take the source code of BeautifulSoup and copy it into your script ;-) I'm only sort of kidding... anything you could write that would do the job would more or less be duplicating the functionality that already exists in libraries like that.

If that's really not going to work, I have to ask, why is it so important that you only use standard library components?

answered Apr 20, 2010 at 16:42

David Z

133k29 gold badges264 silver badges284 bronze badges

12 Comments

bukzor Over a year ago

It's not so important. It's simply my question. As I said, there are tons of html and xml support in the python library. It seems like something there should support this. If not, that's an answer too, but I'm not convinced yet.

Mike Graham Over a year ago

Note that BeautifulSoup is no longer being maintained. I prefer lxml.html myself. Overall, this is a great answer.

David Z Over a year ago

Where did you hear that? The BeautifulSoup website shows no evidence that it is no longer being maintained. In fact the most recent release was 11 days ago. (Of course, any other third-party HTML parser works just as well for the argument I was making in the answer)

Nick T Over a year ago

Maybe he was thinking BS 3.0 was only for Python 3.x? Their site indicates BS 3.0 is for Py 2.3-2.6, and BS 3.1 is for Py 3.x (though ironically the last BS 3.1 release is about a year old, versus a couple weeks for BS 3.0)

Mike Graham Over a year ago

@bukzor, ElementSoup is an implementation of ElementTree using BeautifulSoup for parsing. ElementTree is an API with many implementations for parsing XML and HTML.

|

mikerobi · Accepted Answer · 2010-04-20 17:17:07Z

4

Your choices are to change your requirements or to duplicate all of the work done by the developers of third party modules.

Beautiful soup consists of a single python file with about 2000 lines of code, if that is too big of a dependency, then go ahead and write your own, it won't work as well and probably won't be a whole lot smaller.

answered Apr 20, 2010 at 17:17

mikerobi

21k5 gold badges49 silver badges43 bronze badges

2 Comments

Nick T Over a year ago

If it's really that compact (never really bothered to look :P ) and he's hell-bent on having a script work without any other dependencies, copy-paste sounds a great plan.

Mike Graham Over a year ago

Literal copy-and-paste is a ridiculous way to add a dependency.

PW. · Accepted Answer · 2010-04-20 16:36:43Z

1

doesn't fit your requirement of the std only, but beautifulsoup is nice

answered Apr 20, 2010 at 16:36

PW.

3,66335 silver badges37 bronze badges

1 Comment

bukzor Over a year ago

That's one of the libraries that I referenced with this: "I've found plenty of great third-party libraries for this task, but this question is about the python standard library."

Mike Graham · Accepted Answer · 2010-04-20 17:06:06Z

1

I cannot think of any popular languages with a good, robust, heuristic HTML parsing library in its stdlib. Python certainly does not have one, which is something I think you know.

Why the requirement of a stdlib module? Most of the time when I hear people make that requirement, they are being silly. For most major tasks, you will need a third party module or to spend a whole lot of work re-implementing one. Introducing a dependency is a good thing, since that's work you didn't have to do.

So what you want is lxml.html. Ship lxml with your code if that's an issue, at which point it becomes functionally equivalent to writing it yourself except in difficulty, bugginess, and maintainability.

answered Apr 20, 2010 at 17:06

Mike Graham

77.2k16 gold badges105 silver badges131 bronze badges

4 Comments

bukzor Over a year ago

From my research, I was seeing that as the most common answer, but I don't know, and I'm still not convinced that there's no such capability in the stdlib. You'll have to admit that a script that uses no external library is much more likely to work correctly for novice users.

Mike Graham Over a year ago

@bukzor, Well get convinced, since it's the case. =p And I do not have to admit that at all. ;)

Ian Bicking Over a year ago

Parsing HTML is something people have only actually understood widely for a few years now; it's taken shockingly long. So it can be said quite definitively that there is nothing in the standard library: BeautifulSoup, html5lib, and lxml.html makes a complete list.

bukzor Over a year ago

@Ian Bicking: If you'd make that an answer, I'd check it. Am I getting downrated simply because my answer is no?

Michael · Accepted Answer · 2013-02-16 11:22:40Z

0

As already stated, there is currently no satisfying solution only with standardlib. I had faced the same problem as you, when I tried to run one of my programs on an outdated hosting environment without the possibility to install own extensions and only python2.6. Solution:

Grab this file and the latest stable BeautifulSoup version of the 3er series (3.2.1 as of now). From the tar-file there, only pick BeautifulSoup.py, it's the only one that you really need to ship with your code. So you have these two files in your path, all you need to do then, to get a casual etree object from some HTML string, like you would get it from lxml, is this:

from StringIO import StringIO
import ElementSoup

tree = ElementSoup.parse(StringIO(input_str))

lxml itself and html5lib both require you, to compile some C-code in order to make it run. It is considerably more effort to get them working, and if your environment is restricted, or your intended audience not willing to do that, avoid them.

answered Feb 16, 2013 at 11:22

Michael

7,8061 gold badge41 silver badges64 bronze badges

1 Comment

gsnedders Over a year ago

html5lib has no extensions (e.g., C code) that it depends upon. It can optionally use several (such as datrie) to improve performance, but it will work fine without.

Collectives™ on Stack Overflow

How to parse malformed HTML in python, using standard libraries

6 Answers 6

4 Comments

12 Comments

2 Comments

1 Comment

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

4 Comments

12 Comments

2 Comments

1 Comment

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related