6

I'm trying to make a div element from the below string with html entities. Since my string contains html entities, & reserved char in the html entity is being escaped as & in the output. Thus html entities are displayed as plain text. How can I avoid this so html entities are rendered properly?

s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'

div = etree.Element("div")
div.text = s

lxml.html.tostring(div)

output:
<div>Actress Adamari L&amp;#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&amp;#8482; Website And Resources</div>

1 Answer 1

5

You can specify encoding while calling tostring():

>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>

As a side note, you should definitely use lxml.html.tostring() while dealing with HTML data:

Note that you should use lxml.html.tostring and not lxml.tostring. lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. In particular, things like <script src="..."></script> will be serialized as <script src="..." />, which completely confuses browsers.

Also see:

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.