3

I want to run sklearn's RandomForestClassifier on some data that is packed as a numpy.ndarray which happens to be sparse. Calling fit gives ValueError: setting an array element with a sequence.. From other posts I understand that random forest cannot handle sparse data.

I expected the object to have a todense method, but it doesn't.

>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)
>>> type(X_train)
<class 'numpy.ndarray'>

I tried wrapping it with a SciPy csr_matrix but that gives errors as well.

Is there any way to make random forest accept this data? (not sure that dense would actually fit in memory, but that's another thing...)

EDIT 1

The code generating the error is just this:

X_train = np.load('train.npy') # this returns a ndarray
train_gt = pd.read_csv('train_gt.csv')

model = RandomForestClassifier()
model.fit(X_train, train_gt.target)

As for the suggestion to use toarray(), ndarray does not have such method. AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Moreover, as mentioned, for this particular data I would need terabytes of memory to hold the array. Is there an option to run RandomForestClassifier with a sparse array?

EDIT 2

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.

1
  • 1
    Please provide a minimal reproducible example so that we can see which code causes the error. Right now, you're only showing the type of X_train, but neither its shape nor how you are feeding it into the RandomForestClassifier. Likely, the data is not shaped correctly, see the answer to this related question. Commented Apr 11, 2019 at 16:52

4 Answers 4

8
>>> X_train
array(<1443899x1936774 sparse matrix of type '<class 'numpy.float64'>'
    with 141256894 stored elements in Compressed Sparse Row format>,
      dtype=object)

means that your code, or something it calls, has done np.array(M) where M is a csr sparse matrix. It just wraps that matrix in a object dtype array.

To use a sparse matrix in code that doesn't take sparse matrices, you have to first convert them to dense:

 arr = M.toarray()    # or M.A same thing
 mat = M.todense()    # to make a np.matrix

But given the dimensions and number of nonzero elements, it is likely that this conversion will produce a memory error.

Sign up to request clarification or add additional context in comments.

2 Comments

The object I get is ndarray which does not have toarray or todense. I cannot see any method that would convert that to a csr_matrix
Use X_train[()] to take the wrongly saved matrix out of the array wrapper. Then use toarray.
1

I believe you're looking for the toarray method, as shown in the documentation.

So you can do, e.g., X_dense = X_train.toarray().

Of course, then your computer crashes (unless you have the requisite 22 terabytes of RAM?).

1 Comment

ndarray does not have to toarray method, otherwise I would not pose the question. And you are correct -- the array would require terabytes (I think "just" 2.2) which is not practical.
1

Since you've loaded a csr matrix using np.load, you need to convert it from an np array back to a csr matrix. You said you tried wrapping it with csr_matrix, but that's not the contents of the array, you need to all the .all()

temp = csr_matrix(X_train.all())
X_train = temp.toarray()

Comments

0

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.

RandomForestClassifier can run using data in this format. The code has been running for 1:30h now, so hopefully it will actually finish :-)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.