UnicodeDecodeError while loading file in python

Question

I'm running this:

news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
                                for f in news_train.filenames))

but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.

I checked the news_train.filenames. It is:

array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
       ..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'], 
      dtype='<U74')

Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!

p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py

Yes, it is Python3.5. I did it, but i got "binary mode doesn't take an errors argument" — Denly
– Denly, Commented Jul 29, 2016 at 19:10
Just remove the errors='ignore' can do the trick. Or the answer you posted yourself. — Philip Tzou
– Philip Tzou, Commented Jul 29, 2016 at 22:08

Denly · Accepted Answer · 2016-07-29 20:07:23Z

1

Well I found the solution. Using

open(f, encoding = "latin1")

I'm not sure why it only happens on my mac though. Wish to know it.

answered Jul 29, 2016 at 20:07

Denly

9591 gold badge11 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alastair McCormack Over a year ago

When using text mode with open with Python 3, your locale is used to determine which encoding to decode the file. On Windows, that will be an 8-bit codepage like, latin1. On Mac and modern Linux, it's likely to be UTF-8. You should never open a file without specifying the encoding.

Philip Tzou · Accepted Answer · 2016-07-29 22:16:24Z

0

Actually in Python 3+, the open function opens and reads file in default mode 'r' which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError. The solution is either opening the files in binary mode ('rb'), or specify the correct encoding (encoding="latin1").

open(f, 'rb').read()  # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read()  # returns latin1 decoded `str`

answered Jul 29, 2016 at 22:16

Philip Tzou

6,5582 gold badges22 silver badges31 bronze badges

Collectives™ on Stack Overflow

UnicodeDecodeError while loading file in python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related