pandas read_csv fails on specific row returning UnicodeDecodeError: 'utf-8' codec

Question

Pandas read_csv() returns UnicodeDecodeError on some specific rows. If I use nrows=n1 it works without any error. But when I use nrows=n2 (>n1) somehow it returns UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 12: invalid start byte

It worked fine before, but at some point it keeps me returning the error. Sometimes it works again when I reboot the computer, but only for the first time I try to call it.

Tried read_csv with and without encoding option. Also tried error_bad_lines=False.

This is driving me crazy. Any ideas? If this is related to system issue, at least I want to know how to get the row number of problematic row.

(I exported table from MATLAB with encoding specified as etf-8 (also tried CP949, which is my system's default encoding). Importing from SAS wass successful.)

Which encoding options did you try? You can try to let python detect the encoding, and provide that to read_csv as shown here. — rinkert
– rinkert, Commented Oct 20, 2019 at 8:41
tried utf-8, cp949, or let the python to determine as you suggested. All failed miserably. Haven’t tried chardet yet. Thanks for the suggestion! — crux26
– crux26, Commented Oct 20, 2019 at 8:52
Use chardet.detect, or any text editor able to read your file and tell you what encoding it uses, or one of the many online tools that let you detect your encoding... — Thierry Lathuille
– Thierry Lathuille, Commented Oct 20, 2019 at 8:57

Gonçalo Peres · Accepted Answer · 2022-09-07 10:17:26Z

When using pandas.read_csv there are various parameters one can pass. One of them is the encoding, which will allow the translation of characters. If one is curious to know more about encodings, this can be a good place to start.

As there are a lot of encodings and don't have access to OP's data, one might want to look at this page, where one can find Python standard encodings.

Then, assuming one's file is called data.csv, one will have to use as follows

import pandas as pd

pd.read_csv('data.csv', encoding='iso-8859-1')  # iso-8859-1 is for Western Europe

Again, the list of encodings is vast, so I recommend OP to adjust depending on OP's use case.

From pandas version 1.3.0, argument encoding_errors was added (see this PR). This one influences how encoding errors are handled. See here a list of possible values.

If one wants to replace, considering the enconding used above, then the following should do the work

import pandas as pd

pd.read_csv('data.csv', encoding='iso-8859-1', encoding_errors='replace')

Collectives™ on Stack Overflow

pandas read_csv fails on specific row returning UnicodeDecodeError: 'utf-8' codec

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related