How to detect the right file encoding with python?

Question

I try to read a csv (from https://openwrt.org/_media/toh_dump_tab_separated.zip) in python with pandas using pandas.read_csv(). The problem is the encoding of the file. It is not UTF-8, it is not Latin1. And I don't want to go manually through all the codecs (https://docs.python.org/3/library/codecs.html#standard-encodings).

The workaround is opening the file in Libre Office, replacing weird characters with '-', saving as Latin1 and opening in Python.

How do I do it in Python only?

The following code and error are my current status with UTF-8:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'utf-8')

(...)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 983: invalid start byte

and with Latin1:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'Latin1')

(...)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

Encoding appears to be cp1252.

Mark Tolonen
– Mark Tolonen

2020-12-16 17:31:51 +00:00
Commented Dec 16, 2020 at 17:31 — Mark Tolonen
– Mark Tolonen, Commented Dec 16, 2020 at 17:31

JosefZ · Accepted Answer · 2020-12-17 17:11:23Z

0

Use sep parameter:

import pandas as pd
df = pd.read_csv('ToH_dump_tab_separated.csv', encoding = 'cp1252', sep='\t')
print(df)

          pid  ...                                           comments
0       16132  ...                                                NaN
1       16133  ...                                                NaN
2       16134  ...                                                NaN
3       16135  ...                           Clone of Aztech HW550-3G
4       16137  ...  Image build disabled in master with commit d7d...
...       ...  ...                                                ...
1759  9726386  ...                                                NaN
1760  9878711  ...  Rough edges as of December 2020. Realtek targe...
1761  9912125  ...  Works with WL-WN575A3 image according OpenWrt ...
1762  9927580  ...                                                NaN
1763  9946488  ...                                                NaN

[1764 rows x 67 columns]

FYI, the weird character 0xbf is ¿ Inverted Question Mark U+00BF (or \u00BF):

print( df.switch[:2]); print( df.fccid[-2:])

0    Infineon ADM6996I
1                    ¿
Name: switch, dtype: object
1762                    http://¿
1763    https://fcc.io/Q87-03331
Name: fccid, dtype: object

Edit (tnx Mark Tolonen). Encoding appears to be cp1252. There are smart quotes in some of the fields:

print( df.comments[254][288:])

Ignore the “HW v” on the label - it may not say 2 for v2 hardware

edited Dec 17, 2020 at 17:11

answered Dec 16, 2020 at 17:16

JosefZ

30.5k6 gold badges52 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mark Tolonen Over a year ago

Encoding appears to be cp1252. There are smart quotes in some of the fields.

Cyoux Over a year ago

Thanks for the help! But how do you know that it is cp1252, @MarkTolonen ? Just by rolling the magic codec dice, or looking sharp at the characters and have a good knowledge about codecs?

Mark Tolonen Over a year ago

@Cyoux Option 2.I loaded the data and created a set of the content less the set of ASCII characters, and was left with a few French accents and smart quotes when opened as cp1252. latin1 doesn’t support smart quotes. 1252 is a common codec for US and Western European Windows.

JosefZ Over a year ago

@Cyoux read this thread: What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?

Collectives™ on Stack Overflow

How to detect the right file encoding with python?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related