Python decode in unicode variable with non-ascii character or without

Question

A simple example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import traceback

e_u = u'abc'
c_u = u'中国'

print sys.getdefaultencoding()
try:
    print e_u.decode('utf-8')
    print c_u.decode('utf-8')
except Exception as e:
    print traceback.format_exc()

reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
try:
    print e_u.decode('utf-8')
    print c_u.decode('utf-8')
except Exception as e:
    print traceback.format_exc()

output:

ascii
abc
Traceback (most recent call last):
  File "test_codec.py", line 15, in <module>
    print c_u.decode('utf-8')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

utf-8
abc
中国

Some problems troubled me a few days when I want to thoroughly understand the codec in python, and I want to make sure what I think is right:

Under ascii default encoding, u'abc'.decode('utf-8') have no error, but u'中国'.decode('utf-8') have error.

I think when do u'中国'.decode('utf-8'), Python check and found u'中国' is unicode, so it try to do u'中国'.encode(sys.getdefaultencoding()), this will cause problem, and the exception is UnicodeEncodeError, not error when decode.

but u'abc' have the same code point as 'abc' ( < 128), so there is no error.

In Python 2.x, how does python inner store variable value? If all characters in a string < 128, treat as ascii, if > 128, treat as utf-8?

In [4]: chardet.detect('abc')
Out[4]: {'confidence': 1.0, 'encoding': 'ascii'}

In [5]: chardet.detect('abc中国')
Out[5]: {'confidence': 0.7525, 'encoding': 'utf-8'}

In [6]: chardet.detect('中国')
Out[6]: {'confidence': 0.7525, 'encoding': 'utf-8'}

4 revs · Accepted Answer · 2015-01-22 18:58:20Z

1

Short answer

You have to use encode(), or leave it out. Don't use decode() with unicode strings, that makes no sense. Also, sys.getdefaultencoding() doesn't help here in any way.

Long answer, part 1: How to do it correctly?

If you define:

c_u = u'中国'

then c_u is already a unicode string, that is, it has already been decoded from byte string (of your source file) to a unicode string by the Python interpreter, using your -*- coding: utf-8 -*- declaration.

If you execute:

print c_u.encode()

your string will be encoded back to UTF-8 and that byte string is sent to the standard output. Note that this usually happens automatically for you, so you can simplify this to:

print c_u

Long answer, part 2: What's wrong with c_u.decode()?

If you execute c_u.decode(), Python will

Try to convert your object (i.e. your unicode string) to a byte string
Try to decode that byte string to a unicode string

Note that this doesn't make any sense if your object is a unicode string in the first place - you just convert it forth and back. But why does that fail? Well, this is a strange functionality of Python that first step (1.), i.e. any implicit conversion from unicode string to byte strings, usually uses sys.getdefaultencoding(), which in turn defaults to the ASCII character set. In other words,

c_u.decode()

translates roughly to:

c_u.encode(sys.getdefaultencoding()).decode()

which is why it fails.

Note that while you may be tempted to change that default encoding, don't forget that other third-party libraries may contain similar issues, and might break if the default encoding is different from ASCII.

Having said that, I strongly believe that Python would be better off if they hadn't defined unicode.decode() in the first place. Unicode string are already decoded, there's no point in decoding them once more, especially in the way Python does.

edited Jan 22, 2015 at 18:58

community wiki

4 revs
vog

Sign up to request clarification or add additional context in comments.

8 Comments

Tanky Woo Over a year ago

I know should use encode, my problem is why use decode on u'abc' have no problem, and what I think is right?

vog Over a year ago

See the second part of my answer, where I describe how unicode.decode() behaves internally. This should make clear why u'abc'.decode() accidentally works.

Tanky Woo Over a year ago

The part 2 you said, I think error: any implicit conversion from unicode string to byte strings, always uses the ASCII character set.. See the example code I asked, If I change the default encoding to utf-8, it's ok.

vog Over a year ago

@TankyWoo: You are right. I changed "always" to "usually". I guess only a Python hacker (i.e. a developer of Python itself) can describe this mechanism in full detail.

jfs Over a year ago

"usually uses the ASCII character set." -- it uses sys.getdefaultencoding(). You should mention that changing the default encoding is not recommended. It may break 3rd party libraries that do not expect it.

|

Collectives™ on Stack Overflow

Python decode in unicode variable with non-ascii character or without

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related