Why does using replace here:
s = s.encode('ascii', 'replace')
Give me this error?:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)
Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?
(sorry I can't provide the actual string, the corpus is very large)
In any case, how do I tell python to ignore or replace characters that aren't ascii?
'\xcb', for instance, does just fine. Note, however, that with Python 2.x, that's an ASCII string, not a Unicode string, which is very much relevant to your problem.replaceandignoreare for UnicodeEncodeError handling, not UnicodeDecodeError handling. They're used when you're starting from a unicode string, and creating an ASCII one.encode()with a bytestring, not a unicode string, it'll try to decode it to unicode (to get a unicode string it can then encode back to ASCII as you're asking for), but using settings other than what you want. Thus, if you really want to transcode a bytestring through unicode, you should be writing the code to handle both directions yourself.