How to use NumPy array in Scikit-learn

Question

for a machine learning project I made a Pandas data frame to use in Scikit as input

  label                                             vector
0      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
1      1   1:0.014463682 2:-0.00076486735 3:0.044999316 ...
2      1   1:0.010583069 2:-0.0072133583 3:0.03766079 4:...
3      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
4      1   1:0.039645035 2:-0.039485127 3:0.0898234 4:-0...
..   ...                                                ...
95     0   1:-0.013014212 2:-0.008092734 3:0.050860845 4...
96     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
97     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
98     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
99     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...

Where label correspond to the label of the dataset record and vector correspond to the vector feature of each record.

To pass the data frame to Scikit I'm creating two different arrays, one for the Col label (y) and the other for the col vector (X)

As suggested here to create the X array I'm doing:

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)

Everything works and I'm having as output

               1               2            3  ...            298            299           300
0     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
1    0.014463682  -0.00076486735  0.044999316  ...   -0.008144852  -0.0066369134  -0.013060478
2    0.010583069   -0.0072133583   0.03766079  ...   0.0041615684    0.008569179  -0.008645372
3     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
4    0.039645035    -0.039485127    0.0898234  ...   0.0046293125     0.01663368   0.010215017
..           ...             ...          ...  ...            ...            ...           ...
95  -0.013014212    -0.008092734  0.050860845  ...   0.0021799654   -0.011884902   0.016460473
96  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
97  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
98  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
99  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094

[100 rows x 300 columns]

Where 100 rows are my records and 300 columns the vector feature.

To create the y array as suggested here I'm doing instead this:

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)

The output is:

[100 rows x 2 columns]
[[0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [...]
]

I'm having the NumPy array with the 100 records but instead of 1 column the output is 2 columns.

I think this issue is the cause of the following error. Right?

/Users/mac-pro/scikit_learn/lib/python3.7/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

If so, how can I have as output something like the one I got for the X array?

If helps here the full code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import KFold
from scipy.stats import sem


r_filenameTSV = 'TSV/A19784_test3886.tsv'

tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
                                   columns = ['label','vector'])

print(df)


y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
#exit()

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
#exit()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)


clf = svm.SVC(kernel='rbf',
              C=100,
              gamma=0.001,
              )
scores = cross_val_score(clf, X, y, cv=10)

print ("K-Folds scores:")
print (scores) 

#Train the model using the training sets
clf.fit (X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print ("Metrics and Scoring:")
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1:",metrics.f1_score(y_test, y_pred))

print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred,labels=[0,1]))

Thanks again for your time.

lalfab · Accepted Answer · 2020-04-20 10:17:38Z

1

As the error says, you just need to change the shape of your Y dataset.

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

Hence, you have 2 options for your problem, here are the lines of code that will solve it.

Option 1:

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,1).ravel()
print(y.shape)
# Output
(8,)

Option 2:

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,)
print(y.shape)
# Output
(8,)

Hope this helps you!

answered Apr 20, 2020 at 10:17

lalfab

4014 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hpaulj Over a year ago

df.label.to_numpy() should produce the desired 1d array.

Collectives™ on Stack Overflow

How to use NumPy array in Scikit-learn

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related