0

for a machine learning project I made a Pandas data frame to use in Scikit as input

  label                                             vector
0      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
1      1   1:0.014463682 2:-0.00076486735 3:0.044999316 ...
2      1   1:0.010583069 2:-0.0072133583 3:0.03766079 4:...
3      0   1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
4      1   1:0.039645035 2:-0.039485127 3:0.0898234 4:-0...
..   ...                                                ...
95     0   1:-0.013014212 2:-0.008092734 3:0.050860845 4...
96     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
97     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
98     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
99     0   1:-0.038887568 2:-0.007960074 3:0.03387617 4:...

Where label correspond to the label of the dataset record and vector correspond to the vector feature of each record.

To pass the data frame to Scikit I'm creating two different arrays, one for the Col label (y) and the other for the col vector (X)

As suggested here to create the X array I'm doing:

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)

Everything works and I'm having as output

               1               2            3  ...            298            299           300
0     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
1    0.014463682  -0.00076486735  0.044999316  ...   -0.008144852  -0.0066369134  -0.013060478
2    0.010583069   -0.0072133583   0.03766079  ...   0.0041615684    0.008569179  -0.008645372
3     0.02776011    -0.009072121   0.05915284  ...  0.00035095372    -0.01569933  -0.010564591
4    0.039645035    -0.039485127    0.0898234  ...   0.0046293125     0.01663368   0.010215017
..           ...             ...          ...  ...            ...            ...           ...
95  -0.013014212    -0.008092734  0.050860845  ...   0.0021799654   -0.011884902   0.016460473
96  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
97  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
98  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094
99  -0.038887568    -0.007960074   0.03387617  ...   0.0057248613    0.026993237   0.025746094

[100 rows x 300 columns]

Where 100 rows are my records and 300 columns the vector feature.

To create the y array as suggested here I'm doing instead this:

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)

The output is:

[100 rows x 2 columns]
[[0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [...]
]

I'm having the NumPy array with the 100 records but instead of 1 column the output is 2 columns.

I think this issue is the cause of the following error. Right?

/Users/mac-pro/scikit_learn/lib/python3.7/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

If so, how can I have as output something like the one I got for the X array?

If helps here the full code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import KFold
from scipy.stats import sem


r_filenameTSV = 'TSV/A19784_test3886.tsv'

tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
                                   columns = ['label','vector'])

print(df)


y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
#exit()

X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
#exit()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)


clf = svm.SVC(kernel='rbf',
              C=100,
              gamma=0.001,
              )
scores = cross_val_score(clf, X, y, cv=10)

print ("K-Folds scores:")
print (scores) 

#Train the model using the training sets
clf.fit (X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print ("Metrics and Scoring:")
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1:",metrics.f1_score(y_test, y_pred))

print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred,labels=[0,1]))

Thanks again for your time.

1 Answer 1

1

As the error says, you just need to change the shape of your Y dataset.

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

Hence, you have 2 options for your problem, here are the lines of code that will solve it.

Option 1:

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,1).ravel()
print(y.shape)
# Output
(8,)

Option 2:

y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1,)
print(y.shape)
# Output
(8,)

Hope this helps you!

Sign up to request clarification or add additional context in comments.

1 Comment

df.label.to_numpy() should produce the desired 1d array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.