0

I'm trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below, I get gibberish instead of the greek characters.

import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()

The result is the same both in Netbeans 6.9.1 and in Windows 7 CLI.

I'm doing something wrong, but what?

3
  • Your Python code prints correct Greek characters to me. Commented Nov 16, 2010 at 15:25
  • 2
    Your console is not set to print Unicode (probably not set to handle UTF-8). Search on "Python printing Unicode Characters" since that's your real problem. Commented Nov 16, 2010 at 15:39
  • print urllib2.urlopen("pamestihima.gr").read().encode("utf-8") Commented Nov 16, 2010 at 16:12

2 Answers 2

3
  1. Unicode is not UTF-8. UTF-8 is a string encoding, like ISO-8859-1, ASCII etc.

  2. Always decode your data as soon as possible, to make real Unicode out of it. ('somestring in utf8'.decode('utf-8') == u'somestring in utf-8'), unicode objects are u'' , not ''

  3. When you have data leaving your app, always encode it in the proper encoding. For Web stuff this is utf-8mostly. For console stuff this is whatever your console encoding is. On Windows this is not UTF-8 by default.

Sign up to request clarification or add additional context in comments.

Comments

1

It prints correctly for me, too.

Check the character encoding of the program in which you are viewing the HTML source code. For example, in a Linux terminal, you can find "Set Character Encoding" and make sure it is UTF-8.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.