Parsing HTML in Python [closed]

Question

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 12 years ago.

Improve this question

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated.

I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure most of the input will be pretty clean.

I was trying to avoid answers getting completely sidetracked. My reasons for avoiding BeautifulSoup are hugely debatable but I was saving that for another day! (My reasons for avoiding lxml are simple - a complete failure to install it on either Mac OSX or Linux :( — Andy Baker
– Andy Baker, Commented Apr 5, 2009 at 9:27
Here is how to install lxml on Linux: sudo apt-get install libxml2-dev libxslt-dev python2.7-dev (python2.6-dev if you use Python 2.6). Then sudo pip install lxml. — Jabba
– Jabba, Commented Aug 12, 2011 at 20:32

Andrei Taranchenko · Accepted Answer · 2019-12-17 21:48:06Z

10

Python has a native HTML parser, however the Tidy wrapper Nick suggested would probably be a solid choice as well. Tidy is a very common library, (written in C is it?)

edited Dec 17, 2019 at 21:48

answered Apr 4, 2009 at 20:00

Andrei Taranchenko

1,3141 gold badge10 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shatu Over a year ago

Can someone please tell me as to why do people suggest BeautifulSoup or lxml over the native HTML parser?

Brutus Over a year ago

Link is broken… I gues this is html.parser? Or the version for legacy Python.

Andrei Taranchenko Over a year ago

The module is still there, the URL appears to have changed though. Fixed.

Nick Presta · Accepted Answer · 2009-04-04 18:14:20Z

2

Perhaps µTidylib will meet your needs?

answered Apr 4, 2009 at 18:14

Nick Presta

28.8k6 gold badges60 silver badges76 bronze badges

Comments

Gussisaurio · Accepted Answer · 2012-06-27 17:37:16Z

2

You can install lxml and many other python modules easily and seamlessly on the Mac (OS X) using Pallet, which is the MacPorts official GUI

The module name is py27-lxml. Easy as 1,2,3.

answered Jun 27, 2012 at 17:37

Gussisaurio

412 bronze badges

Comments

Joe Bane · Accepted Answer · 2009-04-04 18:29:55Z

1

http://www.xmlhack.com/read.php?item=1392 http://sourceforge.net/projects/pirxx/

http://pyxml.sourceforge.net/topics/

I don't have much experience with python, but I have used Xerces (from the Apache foundation) in the past and found it to be very useful. The learning curve isn't bad either, though I'm not coming from a python perspective. I suggest you consider it though. (The first two links I've included discuss python interfaces to Xerces and the last one is the first google hit on "python xml").

answered Apr 4, 2009 at 18:29

Joe Bane

1,6461 gold badge16 silver badges34 bronze badges

1 Comment

Joe Bane Over a year ago

I know you want an HTML parser, but these will be good starting places.

seagulf · Accepted Answer · 2011-03-23 14:25:04Z

1

htql is good at handling malformed html:

http://htql.net/

answered Mar 23, 2011 at 14:25

seagulf

3783 silver badges5 bronze badges

Comments

Shatu · Accepted Answer · 2017-08-25 22:43:39Z

1

html5lib is good:
http://code.google.com/p/html5lib/

Update: The link above is broken. A third-party mirror of above, can be accessed from https://github.com/html5lib/gcode-import

edited Aug 25, 2017 at 22:43

Shatu

1,8393 gold badges15 silver badges27 bronze badges

answered Jun 4, 2010 at 11:51

rudyryk

3,8052 gold badges28 silver badges33 bronze badges

2 Comments

dgw Over a year ago

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

Rob Kennedy Over a year ago

This isn't quite a link-only answer, @Dgw. It contains a complete sentence mentioning the name of the linked-to library, and in the case of this question, the name of a library is the essential part of the answer. Anyone can search for it in case the link is dead.

Collectives™ on Stack Overflow

Parsing HTML in Python [closed]

6 Answers 6

3 Comments

Comments

Comments

1 Comment

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

Comments

1 Comment

Comments

2 Comments

Linked

Related