I'm trying to read a (large) text file using python 3.7. I'm trivially doing:
with open(filename,'r') as f:
for il,l in enumerate(f,il):
%do things
this works perfectly if I run the script from Spyder's IPython console on windows.
However if I run the exact same script to read the exact same file (not a copy!) from a unix server, i get the following error:
File "/net/atgcls01/data2/j02660606/code/freeGSA.py", line 127, in read_gwa
for il,l in enumerate(f,il):
File "/u/lon/lamerio/.conda/envs/la3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2099: invalid start byte
I tried to find the culprit to understand what is going on. I did:
bytes = []
fobj = open(settings['GSA_file'],'rb')
for i in range(3000):
b = fobj.read(1)
bytes.append((i, b, b.hex()))
fobj.close()
bytes[2095:2105]
the output is
[(2095, b'0', '30'), (2096, b'0', '30'), (2097, b' ', '20'), (2098, b't', '74'), (2099, b'o', '6f'), (2100, b' ', '20'), (2101, b'5', '35'), (2102, b'6', '36'), (2103, b'1', '31'), (2104, b' ', '20')]
I don't see any 0xb0 character in position 2099. Indeed position 2098 is 0x74, position 2099 is 0x6f and position 2100 is 0x20. These translates to the valid utf-8 characters 't','o',' '(space) that are indeed in position 2099 in the file.
How can I solve that error and why does it arise only on the unix machine?
EDIT: Running
import sys
sys.getdefaultencoding()
returnb 'utf-8' on both systems.
PS: On windows I have version 3.7.5, while on unix I have 3.7.4.
fobj.read().find(b'\xb0')to locate the troublesome character.