Parsing html and js in python using lxml

Question

I'm having trouble parsing JS using lxml in Python. When I execute the code below, my output is:

"< Element div at 0x10cec4e10 >"

from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True 

text = urllib2.urlopen("URL").read().decode("utf-8")
test = lxml.html.fromstring(cleaner.clean_html(text))
print test

What I'm trying to get is the parsed text without the js stuff. Can someone shed some light? Thanks.

Hugh Bothwell · Accepted Answer · 2014-03-01 03:31:16Z

1

import lxml
import urllib2

URL = "http://www.google.com/"
ENCODING = "latin1"

args = {
    "javascript": True,         # strip javascript
    "page_structure": False,    # leave page structure alone
    "style": True               # remove CSS styling
}
cleaner = lxml.html.clean.Cleaner(**args)

# get the page source
html = urllib2.urlopen(URL).read().decode(ENCODING)
# clean it up
clean = cleaner.clean_html(html)

# print unformatted html dump
print(clean)

# print properly indented html
tree = lxml.html.fromstring(clean)
print(lxml.etree.tostring(tree, pretty_print=True))

Note that pretty-printing works properly with lxml.etree.tostring(), but poorly with lxml.html.tostring(), which does linebreaks but not indenting - go figure.

edited Mar 1, 2014 at 3:31

answered Mar 1, 2014 at 3:22

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing html and js in python using lxml

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related