python pandas, unicode decode error on read_csv

Question

When importing a csv file I am getting an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 15: invalid start byte

traceback:

Traceback (most recent call last):

  File "<ipython-input-2-99e71d524b4b>", line 1, in <module>
    runfile('C:/AppData/FinRecon/py_code/python3/DataJoin.py', wdir='C:/AppData/FinRecon/py_code/python3')

  File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
    execfile(filename, namespace)

  File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 500, in <module>
    M5()

  File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 221, in M5
    s3 = pd.read_csv(working_dir+"S3.csv", sep=",") #encode here encoding='utf-16

  File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 435, in _read
    data = parser.read(nrows)

  File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1139, in read
    ret = self._engine.read(nrows)

  File "C:\Users\stack\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1995, in read
    data = self._reader.read(nrows)

  File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read

  File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory

  File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows

  File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data

  File "pandas/_libs/parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens

  File "pandas/_libs/parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype

  File "pandas/_libs/parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert

  File "pandas/_libs/parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 15: invalid start byte

What i've tried:

`s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='utf-16')`

I get error UnicodeError: UTF-16 stream does not start with BOM

What can be done to get this file to be read properly?

The supposed duplicate has absolutely nothing to do with Unicode parsing errors. Voting to reopen. — Ahmed Fasih
– Ahmed Fasih, Commented Feb 20, 2020 at 1:35

Celius Stingher · Accepted Answer · 2019-08-22 14:00:24Z

9

Try using s3 = pd.read_csv(working_dir+"S3.csv", sep=",", encoding='Latin-1')

Mostly encoding issues arise with the characters within the data. While utf-8 supports all languages according to pandas' documentation, utf-8 has a byte structure that must be respected at all times. Some of the values not included in utf-8 are latin small letters i with diaeresis, right-pointing double angle quotation mark, inverted question mark. This are mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.

edited Aug 22, 2019 at 14:00

answered Aug 21, 2019 at 20:48

Celius Stingher

18.4k6 gold badges26 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

excelguy Over a year ago

thank you , it worked. Can you explain why?

Collectives™ on Stack Overflow

python pandas, unicode decode error on read_csv

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related