0

Suddently a "UnicodeDecodeError" arises in a code of mine which worked yesterday.

File "D:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 3284, in run_code self.showtraceback(running_compiled_code=True)

File "D:\Anaconda\lib\site-packages\IPython\core\interactiveshell.py", line 2021, in showtraceback value, tb, tb_offset=tb_offset)

File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line 1379, in structured_traceback self, etype, value, tb, tb_offset, number_of_lines_of_context)

File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line 1291, in structured_traceback elist = self._extract_tb(tb)

File "D:\Anaconda\lib\site-packages\IPython\core\ultratb.py", line 1272, in _extract_tb return traceback.extract_tb(tb)

File "D:\Anaconda\lib\traceback.py", line 72, in extract_tb return StackSummary.extract(walk_tb(tb), limit=limit)

File "D:\Anaconda\lib\traceback.py", line 364, in extract f.line

File "D:\Anaconda\lib\traceback.py", line 286, in line self._line = linecache.getline(self.filename, self.lineno).strip()

File "D:\Anaconda\lib\linecache.py", line 16, in getline lines = getlines(filename, module_globals)

File "D:\Anaconda\lib\linecache.py", line 47, in getlines return updatecache(filename, module_globals)

File "D:\Anaconda\lib\linecache.py", line 137, in updatecache lines = fp.readlines()

File "D:\Anaconda\lib\codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 2441: invalid start byte

import csv
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

dateiname_TDM = "./TDM_example_small.csv" 
dateiname_corpus = "./Topic_Modeling/Input_Data/corpus.mm" 
dateiname_dictionary = "./Topic_Modeling/Input_Data/dictionary.dict"

ids = {}
corpus = []

with open(dateiname_TDM, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=';', quotechar='|') 
    documente = next(reader, None)[1:]
    for rownumber, row in enumerate(reader): 
        for index, field in enumerate(row):
            if index == 0:
                if rownumber > 0:
                    ids[rownumber-1] = field 
            else:
                if rownumber == 0:
                    corpus.append([])
                else:
                    try:
                        if field > 0:
                            corpus[index-1].append((rownumber-1, int(field)))
                    except ValueError:
                        corpus[index-1].append((rownumber-1, 0))
1
  • If your code hasn't changed since yesterday, maybe the input data have. Apparently the CSV file you're reading now wasn't encoded with UTF-8, but probably some 8-bit character set (eg. CP1252). Please also have a look at this post, explaining how this is not quite enough information (we need to know what you think byte 0xf6 should be interpreted as – "ö" maybe?). Commented May 15, 2019 at 21:47

1 Answer 1

0

Without seeing the what's at position 2441 I'm not entirely sure, but it is probably one of the following:

  • A special, non-ascii/extended ascii character, in which case do the_string.encode("UTF-8") or when opening do encoding = "UTF-8" in the open function
  • You have \u or \U somewhere and this makes the next characters read as part of a Unicode sequence so do repr(the_string) to add backslashes to nullify backslashes after (Probably not this one)
  • You are reading a bytes object not a str object. Try opening it with r+b (read & write, bytes) in the open function

I've more or less thrown spaghetti at a wall but I hope this helps!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.