0

I'm running this:

news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
                                for f in news_train.filenames))

but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.

I checked the news_train.filenames. It is:

array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
       ..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'], 
      dtype='<U74')

Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!

p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py

3
  • 1
    Python 3? Try open(f, mode='rb', errors='ignore'). Commented Jul 29, 2016 at 17:56
  • Yes, it is Python3.5. I did it, but i got "binary mode doesn't take an errors argument" Commented Jul 29, 2016 at 19:10
  • Just remove the errors='ignore' can do the trick. Or the answer you posted yourself. Commented Jul 29, 2016 at 22:08

2 Answers 2

1

Well I found the solution. Using

open(f, encoding = "latin1")

I'm not sure why it only happens on my mac though. Wish to know it.

Sign up to request clarification or add additional context in comments.

1 Comment

When using text mode with open with Python 3, your locale is used to determine which encoding to decode the file. On Windows, that will be an 8-bit codepage like, latin1. On Mac and modern Linux, it's likely to be UTF-8. You should never open a file without specifying the encoding.
0

Actually in Python 3+, the open function opens and reads file in default mode 'r' which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError. The solution is either opening the files in binary mode ('rb'), or specify the correct encoding (encoding="latin1").

open(f, 'rb').read()  # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read()  # returns latin1 decoded `str`

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.