python utf-8 encoding throws UnicodeDecodeError despite "errors = 'replace' "

Question

I'm trying to write out some text and encode it as utf-8 where possible, using the following code:

outf.write((lang_name + "," + (script_name or "") + "\n").encode("utf-8", errors='replace'))

I'm getting the following error:

File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode 
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to <undefined>

I thought the errors='replace' part of my encode call would handle that?

fwiw, I'm just opening the file with

outf = open(outfile, 'w')

without explicitly declaring the encoding.

print repr(outf)

produces:

<open file 'myfile.csv', mode 'w' at 0x000000000315E930>

I separated out the write statement into a separate concatenation, encoding, and file write:

outstr = lang_name + "," + (script_name or "") + "\n"
encoded_outstr = outstr.encode("utf-8", errors='replace')
outf.write(encoded_outstr)

It is the concatenation that throws the exception.

The string are, via print repr(foo)

lang_name: 'G\xc4\x81ndh\xc4\x81r\xc4\xab'
script_name: u'Kharo\u1e63\u1e6dh\u012b'

Further detective work reveals that I can concatenate either one of those with a plain ascii string without any difficulty - it's putting them both into the same string that is breaking things.

What is script_code and script_name here? You have a decoding error, not encoding, so one or both are bytestrings, not unicode objects. — Martijn Pieters
– Martijn Pieters, Commented Jul 8, 2015 at 17:43
.encode("utf-8") on a Unicode string will always work, since all Unicode points can be represented as UTF8, so in that case errors='replace' is superfluous. — RemcoGerlich
– RemcoGerlich, Commented Jul 8, 2015 at 17:44
Next, what is outf here? How did you open that object? That your code tries to decode a bytestring as CP1252 is suspicious. For implicit decodings that'd mean you used sys.setdefaultencoding() (a big no-no), but if outf is not a regular Python 2 file object but instead a codecs or io file object that'd explain the exception as well. — Martijn Pieters
– Martijn Pieters, Commented Jul 8, 2015 at 17:44
@MartijnPieters I showed how I opened outf. script_code and script_name are strings scraped from a webpage. — PurpleVermont
– PurpleVermont, Commented Jul 8, 2015 at 17:45
I think that if he puts s = script_code + "," + (script_name or "") + "\n" on the line before, that that will raise the exception. — RemcoGerlich
– RemcoGerlich, Commented Jul 8, 2015 at 17:45

RemcoGerlich · Accepted Answer · 2015-07-08 19:32:12Z

2

So, the problem is that you are concatenating the bytestring 'G\xc4\x81ndh\xc4\x81r\xc4\xab' and the Unicode string u'Kharo\u1e63\u1e6dh\u012b'.

To be able to do that, Python 2.7 tries to decode the bytestring using its default encoding, to turn it into Unicode. Your default encoding is cp1252 instead of ASCII, for reasons I can't know from here, but anyway it fails just like it would had it been ASCII because that string is UTF8.

Your best solution is probably to make sure that this doesn't happen, by changing the way the variables get those values in the first place.

If you can't, since you are encoding to UTF8 on the next line anyway, it's probably easiest to only encode script_name:

encoded_outstr = lang_name + b"," + (script_name.encode('utf-8') or b"") + b"\n"

Note that I used b"," to explicitly make those string literals bytestrings and not Unicode strings; if you are using from __future__ import unicode_literals for Python 3 compatibility, then they are Unicode by default and the problem would just occur again.

answered Jul 8, 2015 at 19:32

RemcoGerlich

31.4k6 gold badges66 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PurpleVermont Over a year ago

the problem is that I don't think the encodings are consistent for every iteration of the loop (!) Is there a programmatic way to test for what the encoding is? I kind of think that that's an open research question ;-)

Mark Ransom · Accepted Answer · 2015-07-08 19:32:58Z

2

When you concatenate a byte string and a Unicode string, Python 2 attempts to convert the byte string to Unicode first. If the byte string contains any non-ASCII characters in the range of \x80 to \xff, the automatic conversion will fail with the error you show. Notice that it says can't decode, not can't encode - this shows that the error did not occur in your call to encode.

The solution is to decode the byte string into Unicode yourself, using the proper code page, so that all the inputs to the concatenation are Unicode strings.

outstr = lang_name.decode("utf-8") + u"," + (script_name or u"") + u"\n"

answered Jul 8, 2015 at 19:32

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

Collectives™ on Stack Overflow

python utf-8 encoding throws UnicodeDecodeError despite "errors = 'replace' "

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related