0

I'm doing some text processing in Python 2.7 with default encoding of ASCII. I'm getting a UnicodeDecodeError when trying to encode some of my strings into utf-8. Specifically, for each word in my document, I do this:

word = word.encode('utf-8')

This works well when my characters are all ASCII but when they're not, I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 5: ordinal not in range(128)

I'm confused, since I thought calling encode would turn everything from ASCII into utf-8. Since utf-8 is a superset of ASCII, I shouldn't get any issues...but I do.

Also, I'm not sure why it says that ASCII can't decode when I would expect it to say that ASCII can't encode my word into utf-8.

Any help would be awesome!

1
  • 1
    this is like a top 3 problem in python 2.X, this has to be a duplicate ... Commented Jul 3, 2018 at 4:38

1 Answer 1

2

You encode to byte strings, decode to Unicode strings. So to encode to a UTF-8 byte string, start with a Unicode string. If you start with a byte string, Python 2.7 implicitly decodes it to Unicode using the default ASCII codec first. If your byte string contains non-ASCII, you then get a UnicodeDecodeError.

Python 3 removes the implicit decode to Unicode when you start with a byte string, and in fact .encode() is not available on byte strings and .decode is not available on Unicode strings. Python 3 also changes the default encoding to UTF-8.

Examples:

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café'.encode('utf8')  # Started with a byte string
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 3: ordinal not in range(128)
>>> u'café'.encode('utf8')  # Started with Unicode string
'caf\xc3\xa9'

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café'.encode()  # Starting with a Unicode string, default UTF-8.
b'caf\xc3\xa9'
>>> 'café'.decode()  # You can only *encode* Unicode strings.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

Further reading: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.