27

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated.

I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure most of the input will be pretty clean.

3
  • 1
    If I may ask, why can't you use lxml, or BS? Commented Apr 4, 2009 at 18:26
  • 1
    I was trying to avoid answers getting completely sidetracked. My reasons for avoiding BeautifulSoup are hugely debatable but I was saving that for another day! (My reasons for avoiding lxml are simple - a complete failure to install it on either Mac OSX or Linux :( Commented Apr 5, 2009 at 9:27
  • 2
    Here is how to install lxml on Linux: sudo apt-get install libxml2-dev libxslt-dev python2.7-dev (python2.6-dev if you use Python 2.6). Then sudo pip install lxml. Commented Aug 12, 2011 at 20:32

6 Answers 6

10

Python has a native HTML parser, however the Tidy wrapper Nick suggested would probably be a solid choice as well. Tidy is a very common library, (written in C is it?)

Sign up to request clarification or add additional context in comments.

3 Comments

Can someone please tell me as to why do people suggest BeautifulSoup or lxml over the native HTML parser?
Link is broken… I gues this is html.parser? Or the version for legacy Python.
The module is still there, the URL appears to have changed though. Fixed.
2

Perhaps µTidylib will meet your needs?

Comments

2

You can install lxml and many other python modules easily and seamlessly on the Mac (OS X) using Pallet, which is the MacPorts official GUI

The module name is py27-lxml. Easy as 1,2,3.

Comments

1

http://www.xmlhack.com/read.php?item=1392 http://sourceforge.net/projects/pirxx/

http://pyxml.sourceforge.net/topics/

I don't have much experience with python, but I have used Xerces (from the Apache foundation) in the past and found it to be very useful. The learning curve isn't bad either, though I'm not coming from a python perspective. I suggest you consider it though. (The first two links I've included discuss python interfaces to Xerces and the last one is the first google hit on "python xml").

1 Comment

I know you want an HTML parser, but these will be good starting places.
1

htql is good at handling malformed html:

http://htql.net/

Comments

1

html5lib is good:
http://code.google.com/p/html5lib/

Update: The link above is broken. A third-party mirror of above, can be accessed from https://github.com/html5lib/gcode-import

2 Comments

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
This isn't quite a link-only answer, @Dgw. It contains a complete sentence mentioning the name of the linked-to library, and in the case of this question, the name of a library is the essential part of the answer. Anyone can search for it in case the link is dead.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.