for a machine learning project I made a Pandas data frame to use in Scikit as input
label vector
0 0 1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
1 1 1:0.014463682 2:-0.00076486735 3:0.044999316 ...
2 1 1:0.010583069 2:-0.0072133583 3:0.03766079 4:...
3 0 1:0.02776011 2:-0.009072121 3:0.05915284 4:-0...
4 1 1:0.039645035 2:-0.039485127 3:0.0898234 4:-0...
.. ... ...
95 0 1:-0.013014212 2:-0.008092734 3:0.050860845 4...
96 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
97 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
98 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
99 0 1:-0.038887568 2:-0.007960074 3:0.03387617 4:...
Where label correspond to the label of the dataset record and vector correspond to the vector feature of each record.
To pass the data frame to Scikit I'm creating two different arrays, one for the Col label (y) and the other for the col vector (X)
As suggested here to create the X array I'm doing:
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
Everything works and I'm having as output
1 2 3 ... 298 299 300
0 0.02776011 -0.009072121 0.05915284 ... 0.00035095372 -0.01569933 -0.010564591
1 0.014463682 -0.00076486735 0.044999316 ... -0.008144852 -0.0066369134 -0.013060478
2 0.010583069 -0.0072133583 0.03766079 ... 0.0041615684 0.008569179 -0.008645372
3 0.02776011 -0.009072121 0.05915284 ... 0.00035095372 -0.01569933 -0.010564591
4 0.039645035 -0.039485127 0.0898234 ... 0.0046293125 0.01663368 0.010215017
.. ... ... ... ... ... ... ...
95 -0.013014212 -0.008092734 0.050860845 ... 0.0021799654 -0.011884902 0.016460473
96 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
97 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
98 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
99 -0.038887568 -0.007960074 0.03387617 ... 0.0057248613 0.026993237 0.025746094
[100 rows x 300 columns]
Where 100 rows are my records and 300 columns the vector feature.
To create the y array as suggested here I'm doing instead this:
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
The output is:
[100 rows x 2 columns]
[[0]
[1]
[1]
[0]
[1]
[1]
[1]
[1]
[0]
[0]
[0]
[0]
[0]
[0]
[1]
[1]
[...]
]
I'm having the NumPy array with the 100 records but instead of 1 column the output is 2 columns.
I think this issue is the cause of the following error. Right?
/Users/mac-pro/scikit_learn/lib/python3.7/site-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
If so, how can I have as output something like the one I got for the X array?
If helps here the full code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from sklearn.model_selection._validation import cross_val_score
from sklearn.model_selection import KFold
from scipy.stats import sem
r_filenameTSV = 'TSV/A19784_test3886.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print(df)
y = pd.DataFrame([df.label]).astype(int).to_numpy().reshape(-1, 1)
print(y)
#exit()
X = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print(X.astype(float).to_numpy())
print(X)
#exit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
clf = svm.SVC(kernel='rbf',
C=100,
gamma=0.001,
)
scores = cross_val_score(clf, X, y, cv=10)
print ("K-Folds scores:")
print (scores)
#Train the model using the training sets
clf.fit (X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print ("Metrics and Scoring:")
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1:",metrics.f1_score(y_test, y_pred))
print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred,labels=[0,1]))
Thanks again for your time.