1

I try to read a csv (from https://openwrt.org/_media/toh_dump_tab_separated.zip) in python with pandas using pandas.read_csv(). The problem is the encoding of the file. It is not UTF-8, it is not Latin1. And I don't want to go manually through all the codecs (https://docs.python.org/3/library/codecs.html#standard-encodings).

The workaround is opening the file in Libre Office, replacing weird characters with '-', saving as Latin1 and opening in Python.

How do I do it in Python only?

The following code and error are my current status with UTF-8:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'utf-8')

(...)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 983: invalid start byte

and with Latin1:

import pandas as pd
df = pd.read_csv('../../Downloads/toh_dump_tab_separated/ToH_dump_tab_separated.csv', encoding = 'Latin1')

(...)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

1
  • Encoding appears to be cp1252. Commented Dec 16, 2020 at 17:31

1 Answer 1

0

Use sep parameter:

import pandas as pd
df = pd.read_csv('ToH_dump_tab_separated.csv', encoding = 'cp1252', sep='\t')
print(df)
          pid  ...                                           comments
0       16132  ...                                                NaN
1       16133  ...                                                NaN
2       16134  ...                                                NaN
3       16135  ...                           Clone of Aztech HW550-3G
4       16137  ...  Image build disabled in master with commit d7d...
...       ...  ...                                                ...
1759  9726386  ...                                                NaN
1760  9878711  ...  Rough edges as of December 2020. Realtek targe...
1761  9912125  ...  Works with WL-WN575A3 image according OpenWrt ...
1762  9927580  ...                                                NaN
1763  9946488  ...                                                NaN

[1764 rows x 67 columns]

FYI, the weird character 0xbf is ¿ Inverted Question Mark U+00BF (or \u00BF):

print( df.switch[:2]); print( df.fccid[-2:])
0    Infineon ADM6996I
1                    ¿
Name: switch, dtype: object
1762                    http://¿
1763    https://fcc.io/Q87-03331
Name: fccid, dtype: object

Edit (tnx Mark Tolonen). Encoding appears to be cp1252. There are smart quotes in some of the fields:

print( df.comments[254][288:])
Ignore the “HW v” on the label - it may not say 2 for v2 hardware
Sign up to request clarification or add additional context in comments.

4 Comments

Encoding appears to be cp1252. There are smart quotes in some of the fields.
Thanks for the help! But how do you know that it is cp1252, @MarkTolonen ? Just by rolling the magic codec dice, or looking sharp at the characters and have a good knowledge about codecs?
@Cyoux Option 2.I loaded the data and created a set of the content less the set of ASCII characters, and was left with a few French accents and smart quotes when opened as cp1252. latin1 doesn’t support smart quotes. 1252 is a common codec for US and Western European Windows.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.