1

I'm trying to read a (large) text file using python 3.7. I'm trivially doing:

with open(filename,'r') as f:
    for il,l in enumerate(f,il):
        %do things

this works perfectly if I run the script from Spyder's IPython console on windows.

However if I run the exact same script to read the exact same file (not a copy!) from a unix server, i get the following error:

  File "/net/atgcls01/data2/j02660606/code/freeGSA.py", line 127, in read_gwa
    for il,l in enumerate(f,il):
  File "/u/lon/lamerio/.conda/envs/la3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 2099: invalid start byte

I tried to find the culprit to understand what is going on. I did:

bytes = []
fobj = open(settings['GSA_file'],'rb')
for i in range(3000):
    b = fobj.read(1)
    bytes.append((i, b, b.hex()))

fobj.close()
bytes[2095:2105]

the output is

[(2095, b'0', '30'), (2096, b'0', '30'), (2097, b' ', '20'), (2098, b't', '74'), (2099, b'o', '6f'), (2100, b' ', '20'), (2101, b'5', '35'), (2102, b'6', '36'), (2103, b'1', '31'), (2104, b' ', '20')]

I don't see any 0xb0 character in position 2099. Indeed position 2098 is 0x74, position 2099 is 0x6f and position 2100 is 0x20. These translates to the valid utf-8 characters 't','o',' '(space) that are indeed in position 2099 in the file.

How can I solve that error and why does it arise only on the unix machine?

EDIT: Running

import sys
sys.getdefaultencoding()

returnb 'utf-8' on both systems.

PS: On windows I have version 3.7.5, while on unix I have 3.7.4.

3
  • 1
    What's python version you're using? Commented Nov 18, 2019 at 11:41
  • 1
    python 3.7. Thanks for pointing out I didn't specify it. I add it to the question Commented Nov 18, 2019 at 11:41
  • 1
    The reason you don't see the character at 2099 is probably because there are other (valid) multi-byte characters earlier in the file. When reading as unicode these are interpreted correctly and appear as single characters. When reading in binary mode each will take up more than one character and shifting the non-unicode character later in the input. You could run fobj.read().find(b'\xb0') to locate the troublesome character. Commented Nov 18, 2019 at 11:53

3 Answers 3

1

On the unix machine, try

with open(filename, encoding='latin-1') as f: ...

or

with open(filename, encoding='windows-1252') as f: ...

Edit: Windows has a different default encoding than UNIX (usually). I assume you edited/created the files on your windows machine. You can also open one of those files, I believe using Notepad, and it will show you the encoding in the bottom right corner. I might be wrong about this, as I'm recalling it from memory. In any case, that's the encoding you want to specify on your UNIX machine. But go ahead and try with the two encodings I have specified.

Sign up to request clarification or add additional context in comments.

2 Comments

even if sys.getdefaultencoding() returned 'utf-8', if I modified the function to open(..., encoding='utf-8') the error could be reproduced also on windows. Changing the line to encoding='latin-1' solved the issue. It is still not clear to me why if sys.getdefaultencoding() is 'utf-8', open can read the file when not specifying the encoding, but returned an error when specifying encoding='utf-8' that should be the default anyway.
Sometimes encodings are weird, and could take paragraphs to elaborate on. In my personal experience trying the above mentioned encodings when such an error occurs works in about 90% of cases. As to why windows claims to have a default encoding of utf-8, or what exactly is going on under the hood when seemingly switches encodings, I can't explain either. Glad it works now though!
1

The problem may be with default encoding. If windows it may not be utf8 but some windows encoding. In Poland the default encoding is cp1250 and such code will work.

with open(filename,'r', enccoding="cp1250") as f:
    for il,l in enumerate(f,il):
        %do things

Comments

0

It's a Unicode character, you can use "unidecode" module to decode it. It will work great.

You can read more about it here: https://pypi.org/project/Unidecode/

3 Comments

But why I don't get the error on windows? And also, unicode characted 0xb0 is the "°" symbol that I definitely don't have in my file
I wouldn't go straight to another package, as that is one more dependency you are dealing with. The error stems from UNIX and Windows having different default encodings. Maybe read up on encodings in the first place, if you want to know where the error stems from (which is a great mindset to have).
I faced the same error on windows and I tried to encode it to UTF-8, but that does not work for me and this package solved my issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.