How to get unicode string when extract data in Python?

Question

I am trying to extract text from a Vietnamese website, which charset is in utf-8. However, the text I got is always in Ascii, and I can't find a way to convert them to unicode or get exactly the text on the website. As a result, I can't save them into file as expected.
I know this is the very popular problem with unicode in Python, but I still hope someone will help me to figure it out. Thanks.
My code:

import requests, re, io
import simplejson as json
from lxml import html, etree

base = "http://www.amthuc365.vn/cong-thuc/"
page = requests.get(base + "trang-" + str(1) + ".html")
pageTree = html.fromstring(page.text)

links = pageTree.xpath('//ul[contains(@class, "mt30")]/li/a/@href')
names = pageTree.xpath('//h3[@class="title"]/a/text()')
for name in names[:1]:
    print name
    # LÃ m bÃ¡nh oreo nhÃ¢n bÆ¡ Äáºu phá»ng thÆ¡m bÃ¹i

but what I need is "Làm bánh oreo nhân bơ đậu phộng thơm bùi"
Thanks.

Community · Accepted Answer · 2017-05-23 10:27:02Z

2

Just switching from page.text to page.content should make it work.

Explanation here.

Also see:

edited May 23, 2017 at 10:27

CommunityBot

11 silver badge

answered Sep 20, 2015 at 2:15

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Huy Do Over a year ago

Thank you very much @alecxe

Collectives™ on Stack Overflow

How to get unicode string when extract data in Python?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related