20

I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?

3
  • Were any of these answers what you were looking for? If you need more info we can certainly help? Commented Jun 20, 2010 at 21:17
  • @JudoWill: Yeah I was able to get BeautifulSoup and Tidy set up. Unfortunately they weren't catching a lot of the issues I was having. I ended up building my own function to go cycle through the DOM and fix the issues. Thanks for the help! Commented Jun 21, 2010 at 2:55
  • Could you post your own function as an answer. This is an issue that I have a lot of the time and I'm always looking for new solutions. :) Commented Jun 21, 2010 at 14:38

5 Answers 5

20

I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

from bs4 import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

Sign up to request clarification or add additional context in comments.

2 Comments

Take caution with performance, BeautifulSoup is very expansive.
@Tarantula. I agree, BeautifulSoup is pretty slow, but its the only thing I've ever seen that can parse some of those crazy malformed HTML based tables out there.
11

An example of cleaning up HTML using the lxml.html.clean.Cleaner module.

Requires the lxml module — pip install lxml (it's a native module written in C so it might be faster than pure python solutions).

import sys

from lxml.html.clean import Cleaner


def sanitize(dirty_html):
    cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )

    return cleaner.clean_html(dirty_html)


if __name__ == '__main__':

    with open(sys.argv[1]) as fin:

        print(sanitize(fin.read()))

Check out the docs for a full list of options you can pass to the Cleaner.

2 Comments

how it can clean from code tags (div) with specific 'id' or 'class'? (completely, include text).
@triwo: this is not supported ootb, but you can parse the markup and remove the nodes by class or id with lxml; e.g. see stackoverflow.com/questions/8226490
4

There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.

Comments

3

I am using lxml to convert HTML to proper (well-formed) XML:

from lxml import etree
tree   = etree.HTML(input_text.replace('\r', ''))
output_text = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml") 
                          for stree in tree ])

... and doing lot of removing of 'dangerous elements' in the middle....

Comments

1

This can be done using the tidy_document function in tidylib module.

import tidylib
html = '<html>...</html>'
inputEncoding = 'utf8'
options = {
    str("output-xhtml"): True, #"output-xml" : True
    str("quiet"): True,
    str("show-errors"): 0,
    str("force-output"): True,
    str("numeric-entities"): True,
    str("show-warnings"): False,
    str("input-encoding"): inputEncoding,
    str("output-encoding"): "utf8",
    str("indent"): False,
    str("tidy-mark"): False,
    str("wrap"): 0
    };
document, errors = tidylib.tidy_document(html, options=options)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.