So I have a message which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this
try :
return message.decode(encoding='utf-8')
except:
try:
return message.decode(encoding='latin-1')
except:
try:
print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
return message.decode(encoding='utf-8', errors="replace")
The returned message is then dumped into a JSON and send to the front end.
I assumed that because I'm using errors="replace"on the last try except that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.
However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace" just take care of this?
(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)
UnicodeDecodeExceptionon?