UnicodeDecodeError when reading file only on unix system

Question

I'm trying to read a (large) text file using python 3.7. I'm trivially doing:

with open(filename,'r') as f:
    for il,l in enumerate(f,il):
        %do things

this works perfectly if I run the script from Spyder's IPython console on windows.

However if I run the exact same script to read the exact same file (not a copy!) from a unix server, i get the following error:

  File "/net/atgcls01/data2/j02660606/code/freeGSA.py", line 127, in read_gwa
    for il,l in enumerate(f,il):
  File "/u/lon/lamerio/.conda/envs/la3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2099: invalid start byte

I tried to find the culprit to understand what is going on. I did:

bytes = []
fobj = open(settings['GSA_file'],'rb')
for i in range(3000):
    b = fobj.read(1)
    bytes.append((i, b, b.hex()))

fobj.close()
bytes[2095:2105]

the output is

[(2095, b'0', '30'), (2096, b'0', '30'), (2097, b' ', '20'), (2098, b't', '74'), (2099, b'o', '6f'), (2100, b' ', '20'), (2101, b'5', '35'), (2102, b'6', '36'), (2103, b'1', '31'), (2104, b' ', '20')]

I don't see any 0xb0 character in position 2099. Indeed position 2098 is 0x74, position 2099 is 0x6f and position 2100 is 0x20. These translates to the valid utf-8 characters 't','o',' '(space) that are indeed in position 2099 in the file.

How can I solve that error and why does it arise only on the unix machine?

EDIT: Running

import sys
sys.getdefaultencoding()

returnb 'utf-8' on both systems.

PS: On windows I have version 3.7.5, while on unix I have 3.7.4.

python 3.7. Thanks for pointing out I didn't specify it. I add it to the question — Luca
– Luca, Commented Nov 18, 2019 at 11:41
The reason you don't see the character at 2099 is probably because there are other (valid) multi-byte characters earlier in the file. When reading as unicode these are interpreted correctly and appear as single characters. When reading in binary mode each will take up more than one character and shifting the non-unicode character later in the input. You could run fobj.read().find(b'\xb0') to locate the troublesome character. — IronFarm
– IronFarm, Commented Nov 18, 2019 at 11:53

TomMP · Accepted Answer · 2019-11-18 11:46:52Z

1

On the unix machine, try

with open(filename, encoding='latin-1') as f: ...

or

with open(filename, encoding='windows-1252') as f: ...

Edit: Windows has a different default encoding than UNIX (usually). I assume you edited/created the files on your windows machine. You can also open one of those files, I believe using Notepad, and it will show you the encoding in the bottom right corner. I might be wrong about this, as I'm recalling it from memory. In any case, that's the encoding you want to specify on your UNIX machine. But go ahead and try with the two encodings I have specified.

answered Nov 18, 2019 at 11:46

TomMP

8251 gold badge7 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Luca Over a year ago

even if sys.getdefaultencoding() returned 'utf-8', if I modified the function to open(..., encoding='utf-8') the error could be reproduced also on windows. Changing the line to encoding='latin-1' solved the issue. It is still not clear to me why if sys.getdefaultencoding() is 'utf-8', open can read the file when not specifying the encoding, but returned an error when specifying encoding='utf-8' that should be the default anyway.

TomMP Over a year ago

Sometimes encodings are weird, and could take paragraphs to elaborate on. In my personal experience trying the above mentioned encodings when such an error occurs works in about 90% of cases. As to why windows claims to have a default encoding of utf-8, or what exactly is going on under the hood when seemingly switches encodings, I can't explain either. Glad it works now though!

Grzegorz Bokota · Accepted Answer · 2019-11-18 11:46:35Z

1

The problem may be with default encoding. If windows it may not be utf8 but some windows encoding. In Poland the default encoding is cp1250 and such code will work.

with open(filename,'r', enccoding="cp1250") as f:
    for il,l in enumerate(f,il):
        %do things

answered Nov 18, 2019 at 11:46

Grzegorz Bokota

1,82414 silver badges19 bronze badges

Comments

Ali k. · Accepted Answer · 2019-11-18 11:44:45Z

0

It's a Unicode character, you can use "unidecode" module to decode it. It will work great.

You can read more about it here: https://pypi.org/project/Unidecode/

answered Nov 18, 2019 at 11:44

Ali k.

273 silver badges11 bronze badges

3 Comments

Luca Over a year ago

But why I don't get the error on windows? And also, unicode characted 0xb0 is the "°" symbol that I definitely don't have in my file

TomMP Over a year ago

I wouldn't go straight to another package, as that is one more dependency you are dealing with. The error stems from UNIX and Windows having different default encodings. Maybe read up on encodings in the first place, if you want to know where the error stems from (which is a great mindset to have).

Ali k. Over a year ago

I faced the same error on windows and I tried to encode it to UTF-8, but that does not work for me and this package solved my issue.

Collectives™ on Stack Overflow

UnicodeDecodeError when reading file only on unix system

3 Answers 3

2 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related