Python - Numpy issues with Pipeline

Question

I have built a neural network and it works fine with a smaller dataset like 300,000 known good rows and 70,000 suspicious rows. I decided to increase the size of known good to 6.5 million rows but have run into some errors with memory, so I decided to try a pipeline and run the dataframe through. I have 2 categorical variables and on column for the dependent variable of 1's and 0's. To start off the dataset looks like this:

DBF2
   ParentProcess                   ChildProcess               Suspicious
0  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
1  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
2  C:\Windows\System32\svchost.exe                      ...            1
3  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
4  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
5  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0

This is what worked but then when my array grew too big it exceeded memory:

X = DBF2.iloc[:, 0:2].values
y = DBF2.iloc[:, 2].values
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#Label Encode destUserName
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
#Label Encode Parent Process
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

#Create dummy variables
onehotencoder = OneHotEncoder(categorical_features = [0,1])
X = onehotencoder.fit_transform(X).toarray()

And get this memory error due to the huge sparse matrix:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 947, in toarray
    out = self._process_toarray_args(order, out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
 MemoryError

So I did some research and found that you can use a Pipeline(How to perform OneHotEncoding in Sklearn, getting value error), and tried to implement that:

2nd Edit

>>> from sklearn.preprocessing import LabelEncoder, OneHotEncoder
>>> labelencoder_X_1 = LabelEncoder()
>>> X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
>>> labelencoder_X_2 = LabelEncoder()
>>> X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

>>> onehotencoder = OneHotEncoder(categorical_features = [0,1])
>>> X = onehotencoder.fit_transform(X)

>>> X
<7026504x7045 sparse matrix of type '<type 'numpy.float64'>'
    with 14053008 stored elements in Compressed Sparse Row format>

#Avoid the dummy variable trap by deleting 1 from each categorical variable
>>> X = np.delete(X, [2038], axis=1)
>>> X = np.delete(X, [0], axis=1)

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


#ERROR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_split.py", line 2031, in train_test_split
    arrays = indexable(*arrays)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 229, in indexable
check_consistent_length(*result)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 200, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 119, in _num_samples
" a valid collection." % x)
TypeError: Singleton array array(<7026504x7045 sparse matrix of type '<type 'numpy.float64'>'
with 14053008 stored elements in Compressed Sparse Row format>,
  dtype=object) cannot be considered a valid collection.

>>> from sklearn.preprocessing import StandardScaler
>>> sc = StandardScaler()
>>> X_train = sc.fit_transform(X_train)
#ERROR

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'X_train' is not defined

>>> X_test = sc.transform(X_test)
#ERROR

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'X_test' is not defined

Vivek Kumar · Accepted Answer · 2018-08-24 15:49:05Z

1

Why are you doing toarray() on the output of OneHotEncoder in the first place? Most scikit estimators will be able to handle sparse matrices well. Your pipeline part is exactly doing the same thing, which you were doing above the memory error.

First, you have done this:

X = DBF2.iloc[:, 0:2].values

Here, DBF2 is pandas DataFrame, which has values attribute to get the underlying numpy array.

So now X is a numpy array. You cannot do X.values anymore. Thats the reason of your first error. You have now corrected it.

Now talking about the warning, its not related to X, but to y. (Its just a warning and nothing to worry) You did this:

y = DBF2.iloc[:, 2].values

So, y is a numpy array of shape (n_samples, 1). 1 because you selected only single column. But most scikit estimators require y of shape (n_samples, ). Observe the empty value after the comma.

So you need to do this:

y = DBF2.iloc[:, 2].values.ravel()

Update:

X is a sparse matrix, so you cannot use numpy operations (np.delete) on that. Do this instead:

index_to_drop = [0, 2038]      #<=== Just add all the columns to drop here
to_keep = list(set(xrange(X.shape[1]))-set(index_to_drop))    
X = X[:,to_keep]

# Your other code here

edited Aug 24, 2018 at 15:49

answered Aug 24, 2018 at 13:17

Vivek Kumar

36.8k9 gold badges116 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sectechguy Over a year ago

Thank you for the feedback and info! I seem to be running into another error when I run my new code. I added it to my question @Vivek Kumar

Vivek Kumar Over a year ago

@sectechguy Thats due to error in your SingleColumnSelector. You need to use iloc there. But again, why are you doing all this pipeline code. Your code before the memory error was good. Just explain why do you want to do .toarray(). You can just do X = onehotencoder.fit_transform(X) and proceed further.

Collectives™ on Stack Overflow

Python - Numpy issues with Pipeline

2nd Edit

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2nd Edit

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related