0

How would I capture the proper encoding from a Japanese synopsis from Google Play? Here is what I have so far:

import requests
from lxml import html
res=requests.get('https://play.google.com/store/tv/show?id=bgJpf84fT4Q')
node=html.fromstring(res.content)
print node.xpath('//div[@itemprop="description"]')[0].text

æ¥æ¬ã®ã©ããã«å­å¨ããå¶æªãªç¯ç½ªãå¤çºããç¡æ³å°å¸¯ãéç§°ãæ··æ²è¡ï¼ã«ã¼ãªã¹ã¿ã¦ã³ï¼ããè­¦å¯ããè¦æ¾ãããã®è¡ã«ãç¯ç½ªèããæããããç¾èã®å¥³æ®ºãå±ãã¡ãå­å¨ãã...ãã®åããã¢ã·ãã¬ãï¼ã¢ãã«ã¬ã¼ã«ãºï¼ããã­ã£ãã¬ã¼ã»ã¢ã·ãã¬ãããã¯ã表åãã¯ç¾èã®è¸ãå­ãã¡ãéãéå ´ã ããè£ã®é¡ã¯æªã¸ã®å¾©è®ãæãèãã¢ã·ãã¬ã«æ®ºããä¾é ¼ããå ´æãå¼·ãçµæåãããããã§ããªããã°ãä¾é ¼äººã«ææç§»å¥ããããã§ããªããéããç©ã¾ããã°ã©ããªç¸æã§ããèªæ¢ã®ç¾èã§ã¯ã¼ã«ï¼ã»ã¯ã·ã¼ã«ãããã¦å¿ãä»çããã®ã ...ãã­ã£ãã¬ã¼ã®çµå¶èã»ã¿ã³ã½ã¯ãã娼婦ã»ãã³ã½ã¯ãã¯ããã¨ããåæ§çãªã­ã£ã©ã¯ã¿ã¼ã¨ã¨ãã«ã仿¥ãã

How can I set utf-8 encoding on the text property?

2
  • If you're using requests, why not use BeautifulSoup? Commented Aug 3, 2015 at 23:54
  • @Kupiakos I just find it a bit easier to parse the xpath from lxml. This is the first time I've run into this encoding issue with non-latin characters. Commented Aug 4, 2015 at 0:02

1 Answer 1

1

First, use res.text, not res.content. The former is an already-decoded unicode. The latter is a not-yet-decoded str.

node=html.fromstring(res.text)

Second, there isn't a <div itemprop="description"> on that page. The only itemprop="description" I could find is in a <meta>, not a <div>, as revealed by:

print [n.tag for n in node.xpath('//*[@itemprop="description"]')]
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. Here is what I'm seeing: <div class="show-more-content text-body" itemprop="description"> Arman has given up all his pleasures and vices: smoking, drinking and fast food and he's gone through an initial extreme combat arts training period of six months. Now Arman travels to 10 exotic countries around the world, including Japan, China, USA, Cambodia and Malaysia -- each a birthplace of a different martial art! He has to learn vital combat skills, extreme discipline and extreme suffering! ...s? <div class="show-more-end"></div> </div>
also, what's the different between res.text and res.url ?
I see the <div class="show-more-content text-body" itemprop="description"> when I use Chrome's dev console, but not when I use requests. I wonder if that div is synthesized by the Javascript, or perhaps different data is returned based on user-agent.
got it, thank you for the clarification in the updated answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.