UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Question

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.

Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com
MAIL FROM: <[email protected]>
...

I was logging all of this in JSON.

Then some folks out there without good intentions decided to send all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

does the string come out of a file or a socket? could you please post code examples of how the string is encoded end decoded before it is send through the socket/filehandler? — devsnd
– devsnd, Commented Sep 17, 2012 at 23:05

Max Ghenis · Accepted Answer · 2019-03-22 17:31:22Z

455

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata:

edited Mar 22, 2019 at 17:31

Max Ghenis

16k17 gold badges93 silver badges142 bronze badges

answered Sep 17, 2012 at 23:05

transilvlad

14.6k13 gold badges48 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Ben Hoyt Over a year ago

Yes, though this is usually bad practice/dangerous, because you'll just lose characters. Better to determine or detect the encoding of the input string and decode it to unicode first, then encode as UTF-8, for example: str.decode('cp1252').encode('utf-8')

transilvlad Over a year ago

In some cases yes you are right it might cause problems. In my case I don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.

kristian Over a year ago

if you ended up here because you are having problems reading a file, opening the file in binary mode might help: open(file_name, "rb") and then apply Ben's approach from the comments above

alper Over a year ago

How can I import unicode ?

tripleee Over a year ago

unicode was a specific string type in Python 2. In Python 3, all regular strings are Unicode strings, so there is nothing to import - just use str. Perhaps see also nedbatchelder.com/text/unipain.html

|

Doğuş · Accepted Answer · 2018-02-12 17:08:35Z

136

Changing the engine from C to Python did the trick for me.

Engine is C:

pd.read_csv(gdp_path, sep='\t', engine='c')

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

Engine is Python:

pd.read_csv(gdp_path, sep='\t', engine='python')

No errors for me.

answered Feb 12, 2018 at 17:08

Doğuş

1,9671 gold badge18 silver badges24 bronze badges

2 Comments

LucasBr Over a year ago

This could be not a good idea if you have a huge csv file. It could lead you to an OutOfMemory error or an automatic restart of your notebook's kernel. You should set the encoding on this case.

Jagannath Banerjee Over a year ago

Excellent answer. Thank You. This worked for me. I had "? " inside a diamond shape character that was causing the issue. With plain eyes i had ' " " which is inch. I did 2 things to figure out. a) df = pd.read_csv('test.csv', n_rows=10000). This worked perfectly without the engine. So i incremented the n_rows to figure out which row had error. b) df = pd.read_csv('test.csv', engine='python') . This worked and i printed the errored row using df.iloc[36145], this printed me the errored record.

James McCormac · Accepted Answer · 2016-12-27 23:16:09Z

83

This type of issue crops up for me now that I've moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.

I found this nice explanation of the differences and how to find a solution after none of the above worked for me.

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

In short, to make Python 3 behave as similarly as possible to Python 2 use:

with open(filename, encoding="latin-1") as datafile:
    # work on datafile here

However, read the article, there is no one size fits all solution.

edited Dec 27, 2016 at 23:16

answered Jun 9, 2016 at 10:21

James McCormac

1,7053 gold badges14 silver badges28 bronze badges

3 Comments

ofloveandhate Over a year ago

the link is broken as of 2021-10-09

alexsmail Over a year ago

As of 2022-02-12 using Python 3.8 I have no problems.

tripleee Over a year ago

Like all the other answers which blindly propose some random encoding, this will be the wrong answer for the majority of visitors. There's a reason the behavior of Python 2 was regarded as broken enough to be replaced. Python 3 transparently does the right thing most of the time, except on Windows, where the burden of the legacy code pages is still significant. The proper cure is to spend some time on understanding encodings. The Stack Overflow character-encoding tag info page has a brief overview and some forward pointers.

Ivan Lee · Accepted Answer · 2019-05-31 03:21:36Z

43

the first,Using get_encoding_type to get the files type of encode:

import os    
from chardet import detect

# get file encoding type
def get_encoding_type(file):
    with open(file, 'rb') as f:
        rawdata = f.read()
    return detect(rawdata)['encoding']

the second, opening the files with the type:

open(current_file, 'r', encoding = get_encoding_type, errors='ignore')

answered May 31, 2019 at 3:21

Ivan Lee

4,4216 gold badges34 silver badges50 bronze badges

2 Comments

Chop Labalagun Over a year ago

what happens when it return None

tripleee Over a year ago

Like the chardet documentation already tells you, it can't guess. or guesses wrong some of the time, because it's just examining statistical correlations. Naïve users will run it on files which don't contain text at all (images, PDF files, executable binaries, etc ... PDFs, Word documents, database dumps etc of course often embed a representation of text, but the file format itself is binary) but sometimes also genuine text documents don't contain enough significant data points to establish an encoding. For illustration, you can guess what ?xac?rbat? represents, but probably not h??y?aie

Ignacio Vazquez-Abrams · Accepted Answer · 2012-09-17 23:06:39Z

38

>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ

answered Sep 17, 2012 at 23:06

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

6 Comments

Cyril N. Over a year ago

I'm confused, how did you choose cp1252? It worked for me, but why ? I don't know and now I'm lost :/. Could you elaborate ? Thanks a lot ! :)

transilvlad Over a year ago

Could you present an option that works for all characters? Is there a way to detect the characters that need to be decoded so a more generic code can be implemented? I see many people are looking at this and I bet for some discarding is not the desired option like it is for me.

transilvlad Over a year ago

As you can see this question has quite the popularity. Think you could expand your answer with a more generic solution?

Puppy Over a year ago

There is no more generic solution to "Guess the encoding roulette"

bolov Over a year ago

found it using a combination of web search, luck and intuition: cp1252 was used by default in the legacy components of Microsoft Windows in English and some other Western languages

|

maiky_forrester · Accepted Answer · 2017-03-13 11:19:57Z

31

I had same problem with UnicodeDecodeError and i solved it with this line. Don't know if is the best way but it worked for me.

str = str.decode('unicode_escape').encode('utf-8')

answered Mar 13, 2017 at 11:19

maiky_forrester

6086 silver badges23 bronze badges

Comments

Community · Accepted Answer · 2021-01-20 11:04:44Z

20

This solution works nice when using Latin American accents, such as 'ñ'.

I have solved this problem just by adding

df = pd.read_csv(fileName,encoding='latin1')

edited Jan 20, 2021 at 11:04

CommunityBot

11 silver badge

answered Jun 3, 2020 at 18:09

Talha Rasool

1,17014 silver badges13 bronze badges

2 Comments

Sridhar Sarnobat Over a year ago

Worked for me too, but I wonder what's going to happen to the Chinese, Greek and Russian named media on my drive. To be continued...

tripleee Over a year ago

Randomly guessing at a character set is not a good solution. Latin-1 will get rid of the warning, but produce garbage if the actual encoding in the file is something else. There are many legacy 8-bit encodings where ñ, á et al. have completely different character codes.

http8086 · Accepted Answer · 2014-04-10 11:26:24Z

3

Just in case of someone has the same problem. I'am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.

answered Apr 10, 2014 at 11:26

http8086

1,79620 silver badges40 bronze badges

6 Comments

transilvlad Over a year ago

How does this relate to this question?

http8086 Over a year ago

Exactly the same, if you know how youcompleteme works. Ycm plugin is socket architecture, communication between client and server is using socket, both are python modules, not able to decode the packets if the encoding setting is incorrect

Reman Over a year ago

I have the same problem. Can you please tell me where to put export LC_CTYPE="en_US.UTF-8"?

http8086 Over a year ago

@Remonn hi, you know we have profile file for bash? Put inside.

Reman Over a year ago

@hylepo, I'm on a windows system :)

|

Krisztián Balla · Accepted Answer · 2018-03-11 14:18:27Z

3

What can you do if you need to make a change to a file, but don’t know the file’s encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the surrogateescape error handler:

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
    data = f.read()

edited Mar 11, 2018 at 14:18

Krisztián Balla

20.5k13 gold badges78 silver badges91 bronze badges

answered Mar 11, 2018 at 12:45

Kothapati Purandhar Reddy

491 bronze badge

2 Comments

Jie Over a year ago

This caused my notebook to crash.

tripleee Over a year ago

This is really weird advice. Probably read the file as raw bytes instead (mode "rb" instead of just "r").

Kai · Accepted Answer · 2022-12-04 19:29:18Z

2

I had the same error.

For me, Python complained about the byte "0x87". I looked it up on https://bytetool.web.app/en/ascii/code/0x87/ where it told me that this byte belong to the codec Windows-1252.

I then only added this line to the beginning of my Python file:

#-*- encoding: Windows-1252 -*-"

And all errors were gone. Before I had added this line, I had tried Pandas to import the file like this:

Df = pd.read_csv(data, sep=",", engine='python', header=0, encoding='Windows-1252')

but this returned me an error. So I changed it back to this:

Df = pd.read_csv(data, sep=",", engine='python', header=0)

edited Dec 4, 2022 at 19:29

answered Dec 4, 2022 at 19:10

Kai

3576 silver badges13 bronze badges

Comments

cottontail · Accepted Answer · 2023-08-18 07:25:17Z

2

A similar error such as

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 22: invalid start byte

also shows up if one tries to open an Excel file using read_csv() in pandas. Using pd.read_excel() instead solves the error.

An example that demonstrates it (the file name is data_dictionary because data dictionaries are most often Excel files while the datasets themselves are CSV files).

import pandas as pd

# some sample data
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df.to_excel('data_dictionary.xlsx', index=False)


df = pd.read_csv("data_dictionary.xlsx")         # <----- error

df = pd.read_excel("data_dictionary.xlsx")       # <----- OK

edited Aug 18, 2023 at 7:25

answered Aug 15, 2023 at 5:25

cottontail

25.5k25 gold badges184 silver badges176 bronze badges

Comments

tripleee · Accepted Answer · 2022-10-25 10:42:56Z

1

If as you say you simply want to permit pure 7-bit ASCII, just discard any bytes which are not. There is no straightforward way to guess what the remote end intended them to represent anyway, without an explicitly specified encoding.

while bytes := socket.read_line_bytes():
    try:
        string = bytes.decode('us-ascii')
    except UnicodeDecodeError as exc:
        logger.warning('[%s] - rejected non-ASCII input %s' % (client, bytes.decode('us-ascii',  errors='backslashreplace'))
        socket.write(b'421 communication error - non-ASCII content rejected\r\n')
        continue
    ...

edited Oct 25, 2022 at 10:42

answered Oct 25, 2022 at 10:37

tripleee

192k37 gold badges318 silver badges367 bronze badges

Comments

Dhinesh Kumar · Accepted Answer · 2022-09-21 05:02:58Z

-1

django-storage is implicitly supported read byte file in text mode till django-storage == 1.8
Removed support in https://github.com/jschneier/django-storages/pull/657
Need to specify the binary mode for reading byte files.

answered Sep 21, 2022 at 5:02

Dhinesh Kumar

3063 silver badges10 bronze badges

Collectives™ on Stack Overflow

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

13 Answers 13

8 Comments

2 Comments

3 Comments

2 Comments

6 Comments

Comments

2 Comments

6 Comments

2 Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

8 Comments

2 Comments

3 Comments

2 Comments

6 Comments

Comments

2 Comments

6 Comments

2 Comments

Comments

Comments

Comments

Comments

Linked

Related