42

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing.

I've found plenty of great third-party libraries for this task, but this question is about the python standard library.

Requirements:

  • Use only Python standard library components (any 2.x version)
  • DOM support
  • Handle HTML entities ( )
  • Handle partial documents (like: Hello, <i>World</i>!)

Bonus points:

  • XPATH support
  • Handle unclosed/malformed tags. (<big>does anyone here know <html ???

Here's my 90% solution, as requested. This works for the limited set of HTML I've tried, but as everyone can plainly see, this isn't exactly robust. Since I did this by staring at the docs for 15 minutes and one line of code, I thought I would be able to consult the stackoverflow community for a similar but better solution...

from xml.etree.ElementTree import fromstring
DOM = fromstring("<html>%s</html>" % html.replace('&nbsp;', '&#160;'))
12
  • 2
    I don't get it. Are you expecting us to do what? You know that there is no such module in stdlib. What is your question? Commented Apr 20, 2010 at 17:04
  • 1
    @bukzor: I think you're misunderstanding the idea behind the stdlib. Commented Apr 20, 2010 at 17:06
  • 2
    @bukzor: If you can get 90% of the way there with std. libs, point out some explicit examples of what you are unable to do. If you work somewhere where you can easily pass along Python scripts, your audience shouldn't fret too much at the 15 seconds it takes to install a nice packaged library, especially if you have it downloaded to your intranet and provide a handy-dandy link in the email. If you're being a sysadmin, maybe repackage a bunch of useful ones and push them out? Commented Apr 20, 2010 at 17:28
  • 2
    @SilentGhost: A common python motto is 'batteries included', meaning that you should be able to do most tasks using the stdlib. Maybe HTML DOM is not one of those things. That's what this question is trying to clarify. Commented Apr 20, 2010 at 17:34
  • 3
    @buzkor: As mikerobi pointed out, the BeautifulSoup source is really small, so if you really want a single-file script with no 3P dependencies, copy-paste sounds like your best bet, and just skip trying to stitch together some stdlibs. Commented Apr 20, 2010 at 17:35

6 Answers 6

48

Parsing HTML reliably is a relatively modern development (weird though that may seem). As a result there is definitely nothing in the standard library. HTMLParser may appear to be a way to handle HTML, but it's not -- it fails on lots of very common HTML, and though you can work around those failures there will always be another case you haven't thought of (if you actually succeed at handling every failure you'll have basically recreated BeautifulSoup).

There are really only 3 reasonable ways to parse HTML (as it is found on the web): lxml.html, BeautifulSoup, and html5lib. lxml is the fastest by far, but can be a bit tricky to install (and impossible in an environment like App Engine). html5lib is based on how HTML 5 specifies parsing; though similar in practice to the other two, it is perhaps more "correct" in how it parses broken HTML (they all parse pretty-good HTML the same). They all do a respectable job at parsing broken HTML. BeautifulSoup can be convenient though I find its API unnecessarily quirky.

Sign up to request clarification or add additional context in comments.

4 Comments

Great answer. Thanks! I don't have enough rep to uprate you. QQ I wish people weren't so touchy about hard questions. The good scientist seeks negative experiments as well..
@Ian Bicking: finally got enough rep to bump you. Just to confirm, there's no known way to get ElementTree (as it exists in the stdlib) to parse real-world HTML?
You can have BeautifulSoup (with ElementSoup) or html5lib parse the HTML and generate an ElementTree structure, but ElementTree itself definitely cannot parse HTML.
With some finagling and a little bit of HTML-correction, I've gotten ElementTree to parse all of RosettaCode.org. The most annoying part is adding all the html entities to the parser by hand. There's even an option for this in the etree docs, but it's unimplemented for undocumented reasons. You can see the code here: bukzor.hopto.org/svn/software/python/rosetta_pylint.py
5

Take the source code of BeautifulSoup and copy it into your script ;-) I'm only sort of kidding... anything you could write that would do the job would more or less be duplicating the functionality that already exists in libraries like that.

If that's really not going to work, I have to ask, why is it so important that you only use standard library components?

12 Comments

It's not so important. It's simply my question. As I said, there are tons of html and xml support in the python library. It seems like something there should support this. If not, that's an answer too, but I'm not convinced yet.
Note that BeautifulSoup is no longer being maintained. I prefer lxml.html myself. Overall, this is a great answer.
Where did you hear that? The BeautifulSoup website shows no evidence that it is no longer being maintained. In fact the most recent release was 11 days ago. (Of course, any other third-party HTML parser works just as well for the argument I was making in the answer)
Maybe he was thinking BS 3.0 was only for Python 3.x? Their site indicates BS 3.0 is for Py 2.3-2.6, and BS 3.1 is for Py 3.x (though ironically the last BS 3.1 release is about a year old, versus a couple weeks for BS 3.0)
@bukzor, ElementSoup is an implementation of ElementTree using BeautifulSoup for parsing. ElementTree is an API with many implementations for parsing XML and HTML.
|
4

Your choices are to change your requirements or to duplicate all of the work done by the developers of third party modules.

Beautiful soup consists of a single python file with about 2000 lines of code, if that is too big of a dependency, then go ahead and write your own, it won't work as well and probably won't be a whole lot smaller.

2 Comments

If it's really that compact (never really bothered to look :P ) and he's hell-bent on having a script work without any other dependencies, copy-paste sounds a great plan.
Literal copy-and-paste is a ridiculous way to add a dependency.
1

doesn't fit your requirement of the std only, but beautifulsoup is nice

1 Comment

That's one of the libraries that I referenced with this: "I've found plenty of great third-party libraries for this task, but this question is about the python standard library."
1

I cannot think of any popular languages with a good, robust, heuristic HTML parsing library in its stdlib. Python certainly does not have one, which is something I think you know.

Why the requirement of a stdlib module? Most of the time when I hear people make that requirement, they are being silly. For most major tasks, you will need a third party module or to spend a whole lot of work re-implementing one. Introducing a dependency is a good thing, since that's work you didn't have to do.

So what you want is lxml.html. Ship lxml with your code if that's an issue, at which point it becomes functionally equivalent to writing it yourself except in difficulty, bugginess, and maintainability.

4 Comments

From my research, I was seeing that as the most common answer, but I don't know, and I'm still not convinced that there's no such capability in the stdlib. You'll have to admit that a script that uses no external library is much more likely to work correctly for novice users.
@bukzor, Well get convinced, since it's the case. =p And I do not have to admit that at all. ;)
Parsing HTML is something people have only actually understood widely for a few years now; it's taken shockingly long. So it can be said quite definitively that there is nothing in the standard library: BeautifulSoup, html5lib, and lxml.html makes a complete list.
@Ian Bicking: If you'd make that an answer, I'd check it. Am I getting downrated simply because my answer is no?
0

As already stated, there is currently no satisfying solution only with standardlib. I had faced the same problem as you, when I tried to run one of my programs on an outdated hosting environment without the possibility to install own extensions and only python2.6. Solution:

Grab this file and the latest stable BeautifulSoup version of the 3er series (3.2.1 as of now). From the tar-file there, only pick BeautifulSoup.py, it's the only one that you really need to ship with your code. So you have these two files in your path, all you need to do then, to get a casual etree object from some HTML string, like you would get it from lxml, is this:

from StringIO import StringIO
import ElementSoup

tree = ElementSoup.parse(StringIO(input_str))

lxml itself and html5lib both require you, to compile some C-code in order to make it run. It is considerably more effort to get them working, and if your environment is restricted, or your intended audience not willing to do that, avoid them.

1 Comment

html5lib has no extensions (e.g., C code) that it depends upon. It can optionally use several (such as datrie) to improve performance, but it will work fine without.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.