I'm running this:
news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
for f in news_train.filenames))
but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.
I checked the news_train.filenames. It is:
array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'],
dtype='<U74')
Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!
p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py
open(f, mode='rb', errors='ignore').