how to get unresolved entities from html attributes using python and lxml

Question

When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. That is, if the actual attribute reads this & that, I get back this & that.

Is there a way to get the unresolved attribute value? Here is some example code that shows my problem, using python2.7 and lxml 3.2.1

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

I want to get the actual string hi & there.

what I would like is a way to get the text unaltered by lxml; cgi.escape will escape by replacing ampersands with entities (for example), but even if it was unescape (replacing entities with ampersands), what I want is the actual text as it exists in the generally unknown HTML source. — Tim
– Tim, Commented May 4, 2015 at 20:15
You'll need to build a custom parser then. Perhaps you can inherit the HTMLParser and override the parsing of the textual bits you want. — Will
– Will, Commented May 5, 2015 at 8:05

har07 · Accepted Answer · 2015-05-05 01:15:02Z

2

Unescaped character is invalid in HTML, and HTML abstraction model (lxml.etree in this case) only works with valid HTML. So there is no notion of unescaped character after the source HTML loaded to the object model.

Given unescaped characters in HTML source, parser will either fails completely, or tries to fix the source automatically. lxml.etree.HTMLParser seems to fall to the latter category. For demo :

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

And I believe, the HTML tree model doesn't retain information regarding the original HTML source, it retains the fixed-valid one instead. So at this point, we can only see that all characters are escaped.

Having said that, how about using cgi.escape() to get escaped entities! :p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there

answered May 5, 2015 at 1:15

har07

89.5k12 gold badges87 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tim Over a year ago

har07 and @Will, thanks--I did not know that the restriction on unescaped chars applied to attributes as well as content. I see what you're both saying and I will rethink my original problem. cgi.escape seems like the only way to answer my question.

Will Over a year ago

You can still build your own parser. Just inherit the standard one and overload the methods you need with some cgi.escape voodoo.

Collectives™ on Stack Overflow

how to get unresolved entities from html attributes using python and lxml

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related