HTML Parser In Python without fixing HTML

Question

I need to parse through html but I do not need the python parsing library to attempt to "fix" the html. Any suggestions on a tool or method to use (in python)? In my situation, if the html is malformed then my script needs to end the processing. I tried BeautifulSoup but it fixed things that I did not want it to fix. I'm creating a tool to parse template files and output another converted template style.

Are you looking for some already created code or do you want to code your own parser? — Victor
– Victor, Commented Oct 31, 2011 at 2:50
I guess a good starting place would be what are you trying to fix? — user849425
– user849425, Commented Oct 31, 2011 at 2:50

Brandon Rhodes · Accepted Answer · 2023-08-25 11:06:18Z

4

The book Foundations of Python Network Programming has a detailed comparison of what it looks like to scrape the same web page with Beautiful Soup and with the lxml library; but, in general, you will find that lxml is faster, more effective, and has an API which adheres closely to a Python standard (the ElementTree API, which comes with the Python Standard Library). See this blog post by the inimitable Ian Bicking for an idea of why you should be looking at lxml instead of the old-fashioned Beautiful Soup library for parsing HTML:

https://ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

edited Aug 25, 2023 at 11:06

answered Oct 31, 2011 at 3:19

Brandon Rhodes

91k16 gold badges110 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Pavel Shvedov Over a year ago

lxml's standard XML parser will raise exception on malformed HTML, and it's HTML parser, will anyway fix the errors.

Brandon Rhodes Over a year ago

Yes, good point — always be sure to use lxml's forgiving HTML parser, never its standard XML parser, when trying to scrape a web page! :)

John La Rooy · Accepted Answer · 2011-10-31 03:00:46Z

I believe BeautifulStoneSoup can do this if you pass in a list of selfclosing tags

The most common shortcoming of BeautifulStoneSoup is that it doesn't know about self-closing tags. HTML has a fixed set of self-closing tags, but with XML it depends on what the DTD says. You can tell BeautifulStoneSoup that certain tags are self-closing by passing in their names as the selfClosingTags argument to the constructor:

from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<selfclosing>Text 2"
print BeautifulStoneSoup(xml).prettify()
# <tag>
#  Text 1
#  <selfclosing>
#   Text 2
#  </selfclosing>
# </tag>

print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()
# <tag>
#  Text 1
#  <selfclosing />
#  Text 2
# </tag>

Collectives™ on Stack Overflow

HTML Parser In Python without fixing HTML

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related