UnicodeDecodeError with processing a csv

Question

Suddently a "UnicodeDecodeError" arises in a code of mine which worked yesterday.

File "D:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 3284, in run_code self.showtraceback(running_compiled_code=True)

File "D:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 2021, in showtraceback value, tb, tb_offset=tb_offset)

File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line 1379, in structured_traceback self, etype, value, tb, tb_offset, number_of_lines_of_context)

File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line 1291, in structured_traceback elist = self._extract_tb(tb)

File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line 1272, in _extract_tb return traceback.extract_tb(tb)

File "D:\Anaconda\lib\traceback.py", line 72, in extract_tb return StackSummary.extract(walk_tb(tb), limit=limit)

File "D:\Anaconda\lib\traceback.py", line 364, in extract f.line

File "D:\Anaconda\lib\traceback.py", line 286, in line self._line = linecache.getline(self.filename, self.lineno).strip()

File "D:\Anaconda\lib\linecache.py", line 16, in getline lines = getlines(filename, module_globals)

File "D:\Anaconda\lib\linecache.py", line 47, in getlines return updatecache(filename, module_globals)

File "D:\Anaconda\lib\linecache.py", line 137, in updatecache lines = fp.readlines()

File "D:\Anaconda\lib\codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 2441: invalid start byte

import csv
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

dateiname_TDM = "./TDM_example_small.csv" 
dateiname_corpus = "./Topic_Modeling/Input_Data/corpus.mm" 
dateiname_dictionary = "./Topic_Modeling/Input_Data/dictionary.dict"

ids = {}
corpus = []

with open(dateiname_TDM, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=';', quotechar='|') 
    documente = next(reader, None)[1:]
    for rownumber, row in enumerate(reader): 
        for index, field in enumerate(row):
            if index == 0:
                if rownumber > 0:
                    ids[rownumber-1] = field 
            else:
                if rownumber == 0:
                    corpus.append([])
                else:
                    try:
                        if field > 0:
                            corpus[index-1].append((rownumber-1, int(field)))
                    except ValueError:
                        corpus[index-1].append((rownumber-1, 0))

If your code hasn't changed since yesterday, maybe the input data have. Apparently the CSV file you're reading now wasn't encoded with UTF-8, but probably some 8-bit character set (eg. CP1252). Please also have a look at this post, explaining how this is not quite enough information (we need to know what you think byte 0xf6 should be interpreted as – "ö" maybe?). — lenz
– lenz, Commented May 15, 2019 at 21:47

Doot · Accepted Answer · 2019-05-15 17:16:49Z

0

Without seeing the what's at position 2441 I'm not entirely sure, but it is probably one of the following:

A special, non-ascii/extended ascii character, in which case do the_string.encode("UTF-8") or when opening do encoding = "UTF-8" in the open function
You have \u or \U somewhere and this makes the next characters read as part of a Unicode sequence so do repr(the_string) to add backslashes to nullify backslashes after (Probably not this one)
You are reading a bytes object not a str object. Try opening it with r+b (read & write, bytes) in the open function

I've more or less thrown spaghetti at a wall but I hope this helps!

answered May 15, 2019 at 17:16

Doot

7757 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

UnicodeDecodeError with processing a csv

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related