Clean Up HTML in Python

Question

I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?

Were any of these answers what you were looking for? If you need more info we can certainly help? — JudoWill
– JudoWill, Commented Jun 20, 2010 at 21:17
@JudoWill: Yeah I was able to get BeautifulSoup and Tidy set up. Unfortunately they weren't catching a lot of the issues I was having. I ended up building my own function to go cycle through the DOM and fix the issues. Thanks for the help! — Joel
– Joel, Commented Jun 21, 2010 at 2:55
Could you post your own function as an answer. This is an issue that I have a lot of the time and I'm always looking for new solutions. :) — JudoWill
– JudoWill, Commented Jun 21, 2010 at 14:38

guerda · Accepted Answer · 2019-11-19 06:54:27Z

20

I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

from bs4 import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

edited Nov 19, 2019 at 6:54

guerda

24.2k28 gold badges102 silver badges151 bronze badges

answered Jun 19, 2010 at 1:31

JudoWill

4,8312 gold badges38 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tarantula Over a year ago

Take caution with performance, BeautifulSoup is very expansive.

JudoWill Over a year ago

@Tarantula. I agree, BeautifulSoup is pretty slow, but its the only thing I've ever seen that can parse some of those crazy malformed HTML based tables out there.

ccpizza · Accepted Answer · 2021-09-08 19:33:59Z

11

An example of cleaning up HTML using the lxml.html.clean.Cleaner module.

Requires the lxml module — pip install lxml (it's a native module written in C so it might be faster than pure python solutions).

import sys

from lxml.html.clean import Cleaner


def sanitize(dirty_html):
    cleaner = Cleaner(page_structure=True,
                  meta=True,
                  embedded=True,
                  links=True,
                  style=True,
                  processing_instructions=True,
                  inline_style=True,
                  scripts=True,
                  javascript=True,
                  comments=True,
                  frames=True,
                  forms=True,
                  annoying_tags=True,
                  remove_unknown_tags=True,
                  safe_attrs_only=True,
                  safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                  remove_tags=('span', 'font', 'div')
                  )

    return cleaner.clean_html(dirty_html)


if __name__ == '__main__':

    with open(sys.argv[1]) as fin:

        print(sanitize(fin.read()))

Check out the docs for a full list of options you can pass to the Cleaner.

edited Sep 8, 2021 at 19:33

answered Sep 22, 2017 at 18:36

ccpizza

32.4k24 gold badges186 silver badges195 bronze badges

2 Comments

Lexx Luxx Over a year ago

how it can clean from code tags (div) with specific 'id' or 'class'? (completely, include text).

ccpizza Over a year ago

@triwo: this is not supported ootb, but you can parse the markup and remove the nodes by class or id with lxml; e.g. see stackoverflow.com/questions/8226490

Nicholas Knight · Accepted Answer · 2010-06-19 00:49:09Z

4

There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.

answered Jun 19, 2010 at 0:49

Nicholas Knight

16.1k5 gold badges47 silver badges58 bronze badges

Comments

ondra · Accepted Answer · 2011-06-26 08:41:23Z

3

I am using lxml to convert HTML to proper (well-formed) XML:

from lxml import etree
tree   = etree.HTML(input_text.replace('\r', ''))
output_text = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml") 
                          for stree in tree ])

... and doing lot of removing of 'dangerous elements' in the middle....

answered Jun 26, 2011 at 8:41

ondra

9,3711 gold badge27 silver badges37 bronze badges

Comments

c2o93y50 · Accepted Answer · 2015-03-22 09:03:30Z

1

This can be done using the tidy_document function in tidylib module.

import tidylib
html = '<html>...</html>'
inputEncoding = 'utf8'
options = {
    str("output-xhtml"): True, #"output-xml" : True
    str("quiet"): True,
    str("show-errors"): 0,
    str("force-output"): True,
    str("numeric-entities"): True,
    str("show-warnings"): False,
    str("input-encoding"): inputEncoding,
    str("output-encoding"): "utf8",
    str("indent"): False,
    str("tidy-mark"): False,
    str("wrap"): 0
    };
document, errors = tidylib.tidy_document(html, options=options)

answered Mar 22, 2015 at 9:03

c2o93y50

2292 silver badges4 bronze badges

Collectives™ on Stack Overflow

Clean Up HTML in Python

5 Answers 5

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related