UTF-8 Encoding with lxml.html

Question

How would I capture the proper encoding from a Japanese synopsis from Google Play? Here is what I have so far:

import requests
from lxml import html
res=requests.get('https://play.google.com/store/tv/show?id=bgJpf84fT4Q')
node=html.fromstring(res.content)
print node.xpath('//div[@itemprop="description"]')[0].text

How can I set utf-8 encoding on the text property?

@Kupiakos I just find it a bit easier to parse the xpath from lxml. This is the first time I've run into this encoding issue with non-latin characters. — David542
– David542, Commented Aug 4, 2015 at 0:02

Robᵩ · Accepted Answer · 2015-08-04 00:17:06Z

1

First, use res.text, not res.content. The former is an already-decoded unicode. The latter is a not-yet-decoded str.

node=html.fromstring(res.text)

Second, there isn't a <div itemprop="description"> on that page. The only itemprop="description" I could find is in a <meta>, not a <div>, as revealed by:

print [n.tag for n in node.xpath('//*[@itemprop="description"]')]

edited Aug 4, 2015 at 0:17

answered Aug 4, 2015 at 0:03

Robᵩ

170k20 gold badges251 silver badges323 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

David542 Over a year ago

Thank you. Here is what I'm seeing: <div class="show-more-content text-body" itemprop="description"> Arman has given up all his pleasures and vices: smoking, drinking and fast food and he's gone through an initial extreme combat arts training period of six months. Now Arman travels to 10 exotic countries around the world, including Japan, China, USA, Cambodia and Malaysia -- each a birthplace of a different martial art! He has to learn vital combat skills, extreme discipline and extreme suffering! ...s? <div class="show-more-end"></div> </div>

David542 Over a year ago

also, what's the different between res.text and res.url ?

Robᵩ Over a year ago

I see the <div class="show-more-content text-body" itemprop="description"> when I use Chrome's dev console, but not when I use requests. I wonder if that div is synthesized by the Javascript, or perhaps different data is returned based on user-agent.

David542 Over a year ago

got it, thank you for the clarification in the updated answer.

Collectives™ on Stack Overflow

UTF-8 Encoding with lxml.html

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related