3

I am trying to extract text from a Vietnamese website, which charset is in utf-8. However, the text I got is always in Ascii, and I can't find a way to convert them to unicode or get exactly the text on the website. As a result, I can't save them into file as expected.
I know this is the very popular problem with unicode in Python, but I still hope someone will help me to figure it out. Thanks.
My code:

import requests, re, io
import simplejson as json
from lxml import html, etree

base = "http://www.amthuc365.vn/cong-thuc/"
page = requests.get(base + "trang-" + str(1) + ".html")
pageTree = html.fromstring(page.text)

links = pageTree.xpath('//ul[contains(@class, "mt30")]/li/a/@href')
names = pageTree.xpath('//h3[@class="title"]/a/text()')
for name in names[:1]:
    print name
    # Làm bánh oreo nhân bÆ¡ Äậu phá»ng thÆ¡m bùi

but what I need is "Làm bánh oreo nhân bơ đậu phộng thơm bùi"
Thanks.

1 Answer 1

2

Just switching from page.text to page.content should make it work.

Explanation here.

Also see:

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much @alecxe

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.