I'm trying to write out some text and encode it as utf-8 where possible, using the following code:
outf.write((lang_name + "," + (script_name or "") + "\n").encode("utf-8", errors='replace'))
I'm getting the following error:
File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to <undefined>
I thought the errors='replace' part of my encode call would handle that?
fwiw, I'm just opening the file with
outf = open(outfile, 'w')
without explicitly declaring the encoding.
print repr(outf)
produces:
<open file 'myfile.csv', mode 'w' at 0x000000000315E930>
I separated out the write statement into a separate concatenation, encoding, and file write:
outstr = lang_name + "," + (script_name or "") + "\n"
encoded_outstr = outstr.encode("utf-8", errors='replace')
outf.write(encoded_outstr)
It is the concatenation that throws the exception.
The string are, via print repr(foo)
lang_name: 'G\xc4\x81ndh\xc4\x81r\xc4\xab'
script_name: u'Kharo\u1e63\u1e6dh\u012b'
Further detective work reveals that I can concatenate either one of those with a plain ascii string without any difficulty - it's putting them both into the same string that is breaking things.
script_codeandscript_namehere? You have a decoding error, not encoding, so one or both are bytestrings, notunicodeobjects..encode("utf-8")on a Unicode string will always work, since all Unicode points can be represented as UTF8, so in that caseerrors='replace'is superfluous.outfhere? How did you open that object? That your code tries to decode a bytestring as CP1252 is suspicious. For implicit decodings that'd mean you usedsys.setdefaultencoding()(a big no-no), but ifoutfis not a regular Python 2 file object but instead acodecsoriofile object that'd explain the exception as well.s = script_code + "," + (script_name or "") + "\n"on the line before, that that will raise the exception.