0

When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. That is, if the actual attribute reads this & that, I get back this & that.

Is there a way to get the unresolved attribute value? Here is some example code that shows my problem, using python2.7 and lxml 3.2.1

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

I want to get the actual string hi &amp; there.

4
  • 1
    cgi.escape(a.attrib.get('alt')) Commented May 4, 2015 at 19:47
  • 2
    stackoverflow.com/questions/1061697/… Commented May 4, 2015 at 19:47
  • what I would like is a way to get the text unaltered by lxml; cgi.escape will escape by replacing ampersands with entities (for example), but even if it was unescape (replacing entities with ampersands), what I want is the actual text as it exists in the generally unknown HTML source. Commented May 4, 2015 at 20:15
  • 1
    You'll need to build a custom parser then. Perhaps you can inherit the HTMLParser and override the parsing of the textual bits you want. Commented May 5, 2015 at 8:05

1 Answer 1

2

Unescaped character is invalid in HTML, and HTML abstraction model (lxml.etree in this case) only works with valid HTML. So there is no notion of unescaped character after the source HTML loaded to the object model.

Given unescaped characters in HTML source, parser will either fails completely, or tries to fix the source automatically. lxml.etree.HTMLParser seems to fall to the latter category. For demo :

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

And I believe, the HTML tree model doesn't retain information regarding the original HTML source, it retains the fixed-valid one instead. So at this point, we can only see that all characters are escaped.

Having said that, how about using cgi.escape() to get escaped entities! :p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there
Sign up to request clarification or add additional context in comments.

2 Comments

har07 and @Will, thanks--I did not know that the restriction on unescaped chars applied to attributes as well as content. I see what you're both saying and I will rethink my original problem. cgi.escape seems like the only way to answer my question.
You can still build your own parser. Just inherit the standard one and overload the methods you need with some cgi.escape voodoo.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.