1

I need to parse through html but I do not need the python parsing library to attempt to "fix" the html. Any suggestions on a tool or method to use (in python)? In my situation, if the html is malformed then my script needs to end the processing. I tried BeautifulSoup but it fixed things that I did not want it to fix. I'm creating a tool to parse template files and output another converted template style.

3
  • Are you looking for some already created code or do you want to code your own parser? Commented Oct 31, 2011 at 2:50
  • I guess a good starting place would be what are you trying to fix? Commented Oct 31, 2011 at 2:50
  • He said he doesn't want to fix anything. Commented Oct 31, 2011 at 2:53

2 Answers 2

4

The book Foundations of Python Network Programming has a detailed comparison of what it looks like to scrape the same web page with Beautiful Soup and with the lxml library; but, in general, you will find that lxml is faster, more effective, and has an API which adheres closely to a Python standard (the ElementTree API, which comes with the Python Standard Library). See this blog post by the inimitable Ian Bicking for an idea of why you should be looking at lxml instead of the old-fashioned Beautiful Soup library for parsing HTML:

https://ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

Sign up to request clarification or add additional context in comments.

2 Comments

lxml's standard XML parser will raise exception on malformed HTML, and it's HTML parser, will anyway fix the errors.
Yes, good point — always be sure to use lxml's forgiving HTML parser, never its standard XML parser, when trying to scrape a web page! :)
2

I believe BeautifulStoneSoup can do this if you pass in a list of selfclosing tags

The most common shortcoming of BeautifulStoneSoup is that it doesn't know about self-closing tags. HTML has a fixed set of self-closing tags, but with XML it depends on what the DTD says. You can tell BeautifulStoneSoup that certain tags are self-closing by passing in their names as the selfClosingTags argument to the constructor:

from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<selfclosing>Text 2"
print BeautifulStoneSoup(xml).prettify()
# <tag>
#  Text 1
#  <selfclosing>
#   Text 2
#  </selfclosing>
# </tag>

print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()
# <tag>
#  Text 1
#  <selfclosing />
#  Text 2
# </tag>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.