0

I have built a neural network and it works fine with a smaller dataset like 300,000 known good rows and 70,000 suspicious rows. I decided to increase the size of known good to 6.5 million rows but have run into some errors with memory, so I decided to try a pipeline and run the dataframe through. I have 2 categorical variables and on column for the dependent variable of 1's and 0's. To start off the dataset looks like this:

DBF2
   ParentProcess                   ChildProcess               Suspicious
0  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
1  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
2  C:\Windows\System32\svchost.exe                      ...            1
3  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
4  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
5  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0

This is what worked but then when my array grew too big it exceeded memory:

X = DBF2.iloc[:, 0:2].values
y = DBF2.iloc[:, 2].values
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#Label Encode destUserName
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
#Label Encode Parent Process
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

#Create dummy variables
onehotencoder = OneHotEncoder(categorical_features = [0,1])
X = onehotencoder.fit_transform(X).toarray()

And get this memory error due to the huge sparse matrix:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 947, in toarray
    out = self._process_toarray_args(order, out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
 MemoryError

So I did some research and found that you can use a Pipeline(How to perform OneHotEncoding in Sklearn, getting value error), and tried to implement that:

2nd Edit

>>> from sklearn.preprocessing import LabelEncoder, OneHotEncoder
>>> labelencoder_X_1 = LabelEncoder()
>>> X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
>>> labelencoder_X_2 = LabelEncoder()
>>> X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

>>> onehotencoder = OneHotEncoder(categorical_features = [0,1])
>>> X = onehotencoder.fit_transform(X)

>>> X
<7026504x7045 sparse matrix of type '<type 'numpy.float64'>'
    with 14053008 stored elements in Compressed Sparse Row format>

#Avoid the dummy variable trap by deleting 1 from each categorical variable
>>> X = np.delete(X, [2038], axis=1)
>>> X = np.delete(X, [0], axis=1)

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


#ERROR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sklearn/model_selection/_split.py", line 2031, in train_test_split
    arrays = indexable(*arrays)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 229, in indexable
check_consistent_length(*result)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 200, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 119, in _num_samples
" a valid collection." % x)
TypeError: Singleton array array(<7026504x7045 sparse matrix of type '<type 'numpy.float64'>'
with 14053008 stored elements in Compressed Sparse Row format>,
  dtype=object) cannot be considered a valid collection.

>>> from sklearn.preprocessing import StandardScaler
>>> sc = StandardScaler()
>>> X_train = sc.fit_transform(X_train)
#ERROR

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'X_train' is not defined

>>> X_test = sc.transform(X_test)
#ERROR

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'X_test' is not defined
0

1 Answer 1

1

Why are you doing toarray() on the output of OneHotEncoder in the first place? Most scikit estimators will be able to handle sparse matrices well. Your pipeline part is exactly doing the same thing, which you were doing above the memory error.

First, you have done this:

X = DBF2.iloc[:, 0:2].values

Here, DBF2 is pandas DataFrame, which has values attribute to get the underlying numpy array.

So now X is a numpy array. You cannot do X.values anymore. Thats the reason of your first error. You have now corrected it.

Now talking about the warning, its not related to X, but to y. (Its just a warning and nothing to worry) You did this:

y = DBF2.iloc[:, 2].values

So, y is a numpy array of shape (n_samples, 1). 1 because you selected only single column. But most scikit estimators require y of shape (n_samples, ). Observe the empty value after the comma.

So you need to do this:

y = DBF2.iloc[:, 2].values.ravel()

Update:

X is a sparse matrix, so you cannot use numpy operations (np.delete) on that. Do this instead:

index_to_drop = [0, 2038]      #<=== Just add all the columns to drop here
to_keep = list(set(xrange(X.shape[1]))-set(index_to_drop))    
X = X[:,to_keep]

# Your other code here
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the feedback and info! I seem to be running into another error when I run my new code. I added it to my question @Vivek Kumar
@sectechguy Thats due to error in your SingleColumnSelector. You need to use iloc there. But again, why are you doing all this pipeline code. Your code before the memory error was good. Just explain why do you want to do .toarray(). You can just do X = onehotencoder.fit_transform(X) and proceed further.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.