Python2: Using .decode with errors='replace' still returns errors

Question

So I have a message which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this

try :
            return message.decode(encoding='utf-8')
        except:
            try:
                return message.decode(encoding='latin-1')
            except:
                try:
                    print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
                    return message.decode(encoding='utf-8', errors="replace")

The returned message is then dumped into a JSON and send to the front end.

I assumed that because I'm using errors="replace"on the last try except that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.

However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace" just take care of this?

(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)

can you paste an example of message? SO widgets are faithful to unicode & other strange stuff so it will make a real minimal reproducible example — Jean-François Fabre
– Jean-François Fabre ♦, Commented Oct 13, 2016 at 19:10
Which line of code do you get the UnicodeDecodeException on? — cdarke
– cdarke, Commented Oct 13, 2016 at 19:22
It's the last line that throws an error. I'll update with the exception trace in a bit. — Jad S
– Jad S, Commented Oct 13, 2016 at 19:30
see my comment on @bobince's answer if curious about what the issue was — Jad S
– Jad S, Commented Nov 3, 2016 at 19:56

bobince · Accepted Answer · 2016-10-14 08:46:26Z

11

You should not get a UnicodeDecodeError with errors='replace'. Also str.decode('latin-1') should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.

My suspicion is that message is already a unicode string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.

When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace', so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError.

(Python 3 no longer does this as it is terribly confusing.)

Check the type of message and assuming it is indeed Unicode, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.

answered Oct 14, 2016 at 8:46

bobince

538k111 gold badges675 silver badges846 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jad S Over a year ago

So you're correct, that seems to be what's happening. I had tried to use either encode('latin-1') or decode('latin-1') an each was giving me errors at some point. I just realized the issue was that there's inconsistency with the string being passed in. Sometimes it's unicode, other times it's encoded. So if I used either of those two methods, I'd get UnicodeEncodeError or UnicodeDecodeError depending on which failed. I fixed it by trying decode then trying encode when that fails, and it seems to work. At least until I find the source of the inconsistent input.

Gabriel Staples Over a year ago

Official documentation always helps to complete the picture and help me make sense of things. Here it is for bytes.decode(): docs.python.org/3/library/stdtypes.html#bytes.decode. And for the errors=replace parameter: docs.python.org/3/library/codecs.html#error-handlers.

Jan · Accepted Answer · 2016-10-13 19:42:47Z

1

decode with error replace implements the 'replace' error handling (for text encodings only): substitutes '?' for encoding errors (to be encoded by the codec), and '\ufffd' (the Unicode replacement character) for decoding errors

text encodings means A "codec which encodes Unicode strings to bytes."

maybe your data is malformed - u should try 'ignore' error handling where malformed data is ignored and encoding or decoding is continued without further notice.

message.decode(encoding='utf-8', errors="ignore")

edited Oct 13, 2016 at 19:42

answered Oct 13, 2016 at 19:36

Jan

465 bronze badges

Collectives™ on Stack Overflow

Python2: Using .decode with errors='replace' still returns errors

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related