AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Question

I am extracting features out of a text corpus, and I am using a td-fidf vectorizer and truncated singular value decomposition from scikit-learn in order to achieve that. However, since the algorithm I want to try out requires dense matrices and the vectorizer returns sparse matrices I need to convert those matrices to dense arrays. But, whenever I try to convert those arrays I get an error telling me that my numpy array object has no atribute "toarray". What am I doing wrong?

The function:

def feature_extraction(train,train_test,test_set):
    vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}',ngram_range = (1,2))        

    print("fitting Vectorizer")
    vectorizer.fit(train)

    print("transforming text")
    train = vectorizer.transform(train)
    train_test = vectorizer.transform(train_test)
    test_set = vectorizer.transform(test_set)

    print("Dimensionality reduction")
    svd = TruncatedSVD(n_components = 100)
    svd.fit(train)
    train = svd.transform(train)
    train_test = svd.transform(train_test)
    test_set = svd.transform(test_set)

    print("convert to dense array")
    train = train.toarray()
    test_set = test_set.toarray()
    train_test = train_test.toarray()

    print(train.shape)
    return train,train_test,test_set

traceback:

Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 24, in <module>
    x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Preprocessing.py", line 57, in feature_extraction
    train = train.toarray()
AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Update: Willy pointed out that my assumption of the matrix being sparse might be wrong. So I tried feeding my data to my algorithm with dimensionality reduction and it actually worked without any conversion, however when I excluded dimensionality reduction, which gave me around 53k features I get the following error:

    Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 28, in <module>
    result = bayesian_ridge(x_train,x_test,y_train,y_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Algorithms.py", line 84, in bayesian_ridge
    algo = algo.fit(x_train,y_train[:,i])
  File "C:\Python27\lib\site-packages\sklearn\linear_model\bayes.py", line 136, in fit
    dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 220, in check_arrays
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Can someone explain this?

Update2

As requested, I'll give all the code involved. Since it is scattered over different files I'll just post it in steps. For clarity I'll leave all the module imports out.

This is how I preprocess my code:

def regexp(data):
    for row in range(len(data)):
        data[row] = re.sub(r'[\W_]+'," ",data[row])
        return data

def clean_the_text(data):
    alist = []
    data = nltk.word_tokenize(data)
    for j in data:
        j = j.lower()
        alist.append(j.rstrip('\n'))
    alist = " ".join(alist)
    return alist
def loop_data(data):
    for i in range(len(data)):
        data[i] = clean_the_text(data[i])
    return data  


if __name__ == "__main__":
    print("loading train")
    train_text = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,1]))))
    print("loading test_set")
    test_set = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"test.csv")))[:,1]))))

After splitting my train_set into a x_train and a x_test for cross_validation I transform my data using the feature_extraction function above.

x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)

Finally I feed them into my algorithm

def bayesian_ridge(x_train,x_test,y_train,y_test,test_set):
    algo = linear_model.BayesianRidge()
    algo = algo.fit(x_train,y_train)
    pred = algo.predict(x_test)
    error = pred - y_test
    result.append(algo.predict(test_set))
    print("Bayes_error: ",cross_val(error))
    return result

If train is already an ndarray, then your assumption about it returning a sparse matrix is incorrect. — willy
– willy, Commented Nov 22, 2013 at 18:26
you should include all the code, not just messages. ndarray is dense by definition, sparse matrices are represented in different objects, so there is rather an error in your code (which you did not attach) — lejlot
– lejlot, Commented Nov 22, 2013 at 19:50

Fred Foo · Accepted Answer · 2013-11-23 13:07:12Z

2

TruncatedSVD.transform returns an array, not a sparse matrix. In fact, in the present version of scikit-learn, only the vectorizers return sparse matrices.

answered Nov 23, 2013 at 13:07

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fred Foo Over a year ago

@Learner: it's in the docstring for that method.

Collectives™ on Stack Overflow

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related