A simple example:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import traceback
e_u = u'abc'
c_u = u'中国'
print sys.getdefaultencoding()
try:
print e_u.decode('utf-8')
print c_u.decode('utf-8')
except Exception as e:
print traceback.format_exc()
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
try:
print e_u.decode('utf-8')
print c_u.decode('utf-8')
except Exception as e:
print traceback.format_exc()
output:
ascii
abc
Traceback (most recent call last):
File "test_codec.py", line 15, in <module>
print c_u.decode('utf-8')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
utf-8
abc
中国
Some problems troubled me a few days when I want to thoroughly understand the codec in python, and I want to make sure what I think is right:
Under
asciidefault encoding,u'abc'.decode('utf-8')have no error, butu'中国'.decode('utf-8')have error.I think when do
u'中国'.decode('utf-8'), Python check and foundu'中国'is unicode, so it try to dou'中国'.encode(sys.getdefaultencoding()), this will cause problem, and the exception isUnicodeEncodeError, not error when decode.but
u'abc'have the same code point as'abc'( < 128), so there is no error.In Python 2.x, how does python inner store variable value? If all characters in a string < 128, treat as
ascii, if > 128, treat asutf-8?In [4]: chardet.detect('abc') Out[4]: {'confidence': 1.0, 'encoding': 'ascii'} In [5]: chardet.detect('abc中国') Out[5]: {'confidence': 0.7525, 'encoding': 'utf-8'} In [6]: chardet.detect('中国') Out[6]: {'confidence': 0.7525, 'encoding': 'utf-8'}