why does pythons `s.encode('ascii', 'replace')` fails encoding

Question

Why does using replace here:

s = s.encode('ascii', 'replace')

Give me this error?:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)

Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?

(sorry I can't provide the actual string, the corpus is very large)

In any case, how do I tell python to ignore or replace characters that aren't ascii?

You don't need to provide the actual string, you just need to provide a string that reproduces the same problem. '\xcb', for instance, does just fine. Note, however, that with Python 2.x, that's an ASCII string, not a Unicode string, which is very much relevant to your problem. — Charles Duffy
– Charles Duffy, Commented Sep 14, 2015 at 15:48
That said, replace and ignore are for UnicodeEncodeError handling, not UnicodeDecodeError handling. They're used when you're starting from a unicode string, and creating an ASCII one. — Charles Duffy
– Charles Duffy, Commented Sep 14, 2015 at 15:49
So if I encode a string as ascii with replace option, I may not be able to decode that string safely as ascii? Wouldn't that by definition mean that the encode function didn't create an ascii string? — anthonybell
– anthonybell, Commented Sep 14, 2015 at 15:53
"So if I encode a string as ascii with replace option, I may not be able to decode that string safely as ascii?" -- huh? I said nothing of the sort. — Charles Duffy
– Charles Duffy, Commented Sep 14, 2015 at 15:55
...to try to restate a bit more clearly: the problem is that if you call encode() with a bytestring, not a unicode string, it'll try to decode it to unicode (to get a unicode string it can then encode back to ASCII as you're asking for), but using settings other than what you want. Thus, if you really want to transcode a bytestring through unicode, you should be writing the code to handle both directions yourself. — Charles Duffy
– Charles Duffy, Commented Sep 14, 2015 at 15:57

Charles Duffy · Accepted Answer · 2015-09-14 15:55:59Z

4

Note that you're getting a UnicodeDecodeError, not a UnicodeEncodeError.

That's because s.encode() takes a unicode string as input, but in this case you're not giving it one; you're giving it a bytestring instead.

Thus, it's encoding the bytestring you're handing it to unicode before trying to decode it, and it's in that initial encode that the error occurs.

This three-way round-trip is silly, but if you really wanted to do it:

s_bytes = '\xcb' # standard Python 2 string, aka a Python 3 bytestring
s_unicode = s_bytes.decode('ascii', 'replace') # a unicode string now
s_ascii = s_unicode.encode('ascii', 'replace') # a bytestring again

edited Sep 14, 2015 at 15:55

answered Sep 14, 2015 at 15:50

Charles Duffy

299k43 gold badges441 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

anthonybell Over a year ago

I am giving it a text file's contents as input (which is a string).

Charles Duffy Over a year ago

A Python 2 string, right? Also what Python 3 calls a bytestring? Then this answer is precisely on-point.

Andrea Corbellini Over a year ago

@anthonybell: check type(s). It will be str, that is: a byte string, not a unicode string. Bytestrings can be converted to unicode (using decode); unicode strings can be converted to bytestrings (using encode).

anthonybell Over a year ago

Ohh, i think I got it. encode changes the internal represention of the charecters in memory then? So if my string (a byte-string) should only have ascii charecters I have to encode to ascii and decode it back from ascii to get back a normal python string with only ascii supported charecters in it?

Charles Duffy Over a year ago

"Have to" is a little strong -- there are other approaches available (heck, you could just filter out any character not in the printable set), but if you want to use the encode/decode facilities (which are meant for converting to and from unicode), yes, that's the way you do it.

|

Collectives™ on Stack Overflow

why does pythons `s.encode('ascii', 'replace')` fails encoding

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related