Make a new list in CSV without using pandas: return UnicodeDecodeError

Question

I am trying to make a new list in my existing csv file (not using pandas). Here is my code:

with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv','r') as case0:
    reader = csv.DictReader(case0)
    album = []
    for row in reader:
        album.append(row)
print ("Number of albums is:",len(album))

The CSV file was downloaded from the Rolling Stone's Top 500 albums data set on data.world.

My logic is to create an empty list named album and have all the records in this list. But it seems the line of for row in reader has some issue.

the error message I got is:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 1040: invalid continuation byte

Can anyone let me know what I did wrong?

This can be helpful: stackoverflow.com/questions/12752313/… — rockikz
– rockikz, Commented Feb 10, 2019 at 17:40
The data file you are reading from is probably not UTF-8 encoded. You'll have to specify a different encoding with encoding="....". What encoding to use, we can't tell you, sorry. — Martijn Pieters
– Martijn Pieters, Commented Feb 10, 2019 at 17:41
Where did the dataset come from? Did they document this perhaps? — Martijn Pieters
– Martijn Pieters, Commented Feb 10, 2019 at 18:06
Since you got as far as 1040 bytes into the CSV until you hit a decoding error, chances are the file is also encoded with a single-byte encoding, the most popular of which is cp1252. Give it a try first. — blhsing
– blhsing, Commented Feb 10, 2019 at 18:12

Martijn Pieters · Accepted Answer · 2019-02-11 13:28:41Z

You need to open the file in the correct codec; UTF-8 is not the correct one. The dataset doesn't specify it, but I have determined that the most likely codec is mac_roman:

with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv', 'r', encoding='mac_roman') as case0:

The original Kaggle dataset doesn't bother to document it, and the various kernels that use the set all just clobber the encoding. It's clearly a 8-bit Latin-variant (the majority of the data is ASCII with a few individual 8-bit codepoints).

So I analysed the data, and found there are just two such codepoints in 9 rows:

>>> import re
>>> eightbit = re.compile(rb'[\x80-\xff]')
>>> with open('albumlist.csv', 'rb') as bindata:
...     nonascii = [l for l in bindata if eightbit.search(l)]
...
>>> len(nonascii)
9
>>> {c for l in nonascii for c in eightbit.findall(l)}
{b'\x89', b'\xca'}

The 0x89 byte appears in just one line:

>>> sum(l.count(b'\x89') for l in nonascii)
1
>>> sum(l.count(b'\xca') for l in nonascii)
22
>>> next(l for l in nonascii if b'\x89' in l)
b'359,1972,Honky Ch\x89teau,Elton John,Rock,"Pop Rock,\xcaClassic Rock"\r\n'

That's clearly Elton John's 1972 Honky Château album, so the 0x89 byte must represent the U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX codepoint.

The 0xCA bytes all appear to represent an alternative space character, they all appear righ after commas in the genre and subgenre columns (with one album exception):

>>> import csv
>>> for row in csv.reader((l.decode('ascii', 'backslashreplace') for l in nonascii)):
...     for col in row:
...         if '\\' in col: print(col)
...
Reggae,\xcaPop,\xcaFolk, World, & Country,\xcaStage & Screen
Reggae,\xcaRoots Reggae,\xcaRocksteady,\xcaContemporary,\xcaSoundtrack
Electronic,\xcaStage & Screen
Soundtrack,\xcaDisco
Rock,\xcaBlues
Blues Rock,\xcaElectric Blues,\xcaHarmonica Blues
Garage Rock,\xcaPsychedelic Rock
Honky Ch\x89teau
Pop Rock,\xcaClassic Rock
Funk / Soul,\xcaFolk, World, & Country
Rock,\xcaPop
Stan Getz\xca/\xcaJoao Gilberto\xcafeaturing\xcaAntonio Carlos Jobim
Bossa Nova,\xcaLatin Jazz
Lo-Fi,\xcaIndie Rock

These 0xCA bytes are almost certainly representing the U+00A0 NO-BREAK SPACE codepoint.

With these two mappings, you can try to determine what 8-bit codecs would make the same mapping. Rather than manually try out all Python's codecs I used Tripleee's 8-bit codec mapping to see what codecs use these mappings. There are only two:

0x89

â‎ (U+00E2): mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish

0xca

‎ (U+00A0): mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish

There are 6 encodings that are listed in both sets:

>>> set1 = set('mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set2 = set('mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set1 & set2
{'mac_turkish', 'mac_iceland', 'mac_romanian', 'mac_greek', 'mac_croatian', 'mac_roman'}

Of these, Mac OS Roman mac_roman codec is probably the most likely to have been used as Microsoft Excel for Mac used Mac Roman to create CSV files for a long time. However, it doesn't really matter, any of those 6 would work here.

You may want to replace those U+00A0 non-breaking spaces if you want to split out the genre and subgenre columns (really the genre and style columns if these were taken from Discogs).

Collectives™ on Stack Overflow

Make a new list in CSV without using pandas: return UnicodeDecodeError

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related