4

Why does using replace here:

s = s.encode('ascii', 'replace')

Give me this error?:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)

Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?

(sorry I can't provide the actual string, the corpus is very large)

In any case, how do I tell python to ignore or replace characters that aren't ascii?

5
  • 1
    You don't need to provide the actual string, you just need to provide a string that reproduces the same problem. '\xcb', for instance, does just fine. Note, however, that with Python 2.x, that's an ASCII string, not a Unicode string, which is very much relevant to your problem. Commented Sep 14, 2015 at 15:48
  • 1
    That said, replace and ignore are for UnicodeEncodeError handling, not UnicodeDecodeError handling. They're used when you're starting from a unicode string, and creating an ASCII one. Commented Sep 14, 2015 at 15:49
  • So if I encode a string as ascii with replace option, I may not be able to decode that string safely as ascii? Wouldn't that by definition mean that the encode function didn't create an ascii string? Commented Sep 14, 2015 at 15:53
  • "So if I encode a string as ascii with replace option, I may not be able to decode that string safely as ascii?" -- huh? I said nothing of the sort. Commented Sep 14, 2015 at 15:55
  • ...to try to restate a bit more clearly: the problem is that if you call encode() with a bytestring, not a unicode string, it'll try to decode it to unicode (to get a unicode string it can then encode back to ASCII as you're asking for), but using settings other than what you want. Thus, if you really want to transcode a bytestring through unicode, you should be writing the code to handle both directions yourself. Commented Sep 14, 2015 at 15:57

1 Answer 1

4

Note that you're getting a UnicodeDecodeError, not a UnicodeEncodeError.

That's because s.encode() takes a unicode string as input, but in this case you're not giving it one; you're giving it a bytestring instead.

Thus, it's encoding the bytestring you're handing it to unicode before trying to decode it, and it's in that initial encode that the error occurs.


This three-way round-trip is silly, but if you really wanted to do it:

s_bytes = '\xcb' # standard Python 2 string, aka a Python 3 bytestring
s_unicode = s_bytes.decode('ascii', 'replace') # a unicode string now
s_ascii = s_unicode.encode('ascii', 'replace') # a bytestring again
Sign up to request clarification or add additional context in comments.

6 Comments

I am giving it a text file's contents as input (which is a string).
A Python 2 string, right? Also what Python 3 calls a bytestring? Then this answer is precisely on-point.
@anthonybell: check type(s). It will be str, that is: a byte string, not a unicode string. Bytestrings can be converted to unicode (using decode); unicode strings can be converted to bytestrings (using encode).
Ohh, i think I got it. encode changes the internal represention of the charecters in memory then? So if my string (a byte-string) should only have ascii charecters I have to encode to ascii and decode it back from ascii to get back a normal python string with only ascii supported charecters in it?
"Have to" is a little strong -- there are other approaches available (heck, you could just filter out any character not in the printable set), but if you want to use the encode/decode facilities (which are meant for converting to and from unicode), yes, that's the way you do it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.