9

i have just known Python for few days. Unicode seems to be a problem with Python.

i have a text file stores a text string like this

'\u0110\xe8n \u0111\u1ecf n\xfat giao th\xf4ng Ng\xe3 t\u01b0 L\xe1ng H\u1ea1'

i can read the file and print the string out but it displays incorrectly. How can i print it out to screen correctly as follow:

"Đèn đỏ nút giao thông Ngã tư Láng Hạ"

Thanks in advance

1
  • 1
    By "print the string", do you mean to a console? If so, it's probably your console that's the problem - are you sure it supports Unicode characters? Commented May 18, 2010 at 8:44

3 Answers 3

8
>>> x=r'\u0110\xe8n \u0111\u1ecf n\xfat giao th\xf4ng Ng\xe3 t\u01b0 L\xe1ng H\u1ea1'
>>> u=unicode(x, 'unicode-escape')
>>> print u
Đèn đỏ nút giao thông Ngã tư Láng Hạ

This works in a Mac, where Terminal.App correctly makes sys.stdout.encoding be set to utf-8. If your platform doesn't set that attribute correctly (or at all), you'll need to replace the last line with

print u.decode('utf8')

or whatever other encoding your terminal/console is using.

Note that in the first line I assign a raw string literal so that the "escape sequences" would not be expanded -- that just mimics what would happen if bytestring x was being read from a (text or binary) file with that literal content.

Sign up to request clarification or add additional context in comments.

Comments

1

It helps to show a simple example with code and output what you have explicitly tried. At a guess your console doesn't support Vietnamese. Here are some options:

# A byte string with Unicode escapes as text.
>>> x='\u0110\xe8n \u0111\u1ecf n\xfat giao th\xf4ng Ng\xe3 t\u01b0 L\xe1ng H\u1ea1'

# Convert to Unicode string.
>>> x=x.decode('unicode-escape')
>>> x
u'\u0110\xe8n \u0111\u1ecf n\xfat giao th\xf4ng Ng\xe3 t\u01b0 L\xe1ng H\u1ea1'

# Try to print to my console:
>>> print x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0110' in position 0:
  character maps to <undefined>

# My console's encoding is cp437.
# Instead of the default strict error handling that throws exceptions, try:
>>> print x.encode('cp437','replace')
?èn ?? nút giao thông Ng? t? Láng H?    

# Six characters weren't supported.
# Here's a way to write the text to a temp file and display it with another
# program that supports the UTF-8 encoding:
>>> import tempfile
>>> f,name=tempfile.mkstemp()
>>> import os
>>> os.write(f,x.encode('utf8'))
48
>>> os.close(f)
>>> os.system('notepad.exe '+name)

Hope that helps you.

Comments

0

Try this

>>> s=u"\u0110\xe8n \u0111\u1ecf n\xfat giao th\xf4ng Ng\xe3 t\u01b0 L\xe1ng H\u1ea1"
>>> print s
=> Đèn đỏ nút giao thông Ngã tư Láng Hạ

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.