1

I'm having trouble parsing JS using lxml in Python. When I execute the code below, my output is:

"< Element div at 0x10cec4e10 >"

from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True 

text = urllib2.urlopen("URL").read().decode("utf-8")
test = lxml.html.fromstring(cleaner.clean_html(text))
print test

What I'm trying to get is the parsed text without the js stuff. Can someone shed some light? Thanks.

1 Answer 1

1
import lxml
import urllib2

URL = "http://www.google.com/"
ENCODING = "latin1"

args = {
    "javascript": True,         # strip javascript
    "page_structure": False,    # leave page structure alone
    "style": True               # remove CSS styling
}
cleaner = lxml.html.clean.Cleaner(**args)

# get the page source
html = urllib2.urlopen(URL).read().decode(ENCODING)
# clean it up
clean = cleaner.clean_html(html)

# print unformatted html dump
print(clean)

# print properly indented html
tree = lxml.html.fromstring(clean)
print(lxml.etree.tostring(tree, pretty_print=True))

Note that pretty-printing works properly with lxml.etree.tostring(), but poorly with lxml.html.tostring(), which does linebreaks but not indenting - go figure.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.