Numpy array conversion error

Question

I have a dataset with string and float data. numPy tries to convert everything to a float, giving the error "cannot convert string to float"

import numpy as np
import scipy
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

pd.set_option('display.height', 750)
pd.set_option('display.width', 750)

colnames = ['AGE', 'WORKCLASS', 'FNLWGT','EDU','EDU-NUM','MARITAL- 
STATUS','JOB','RELATIONSHIP','RACE', 'SEX', 'CAPITAL-GAIN', 'CAPITAL- 
LOSS','HOURS-PER-WEEK', 'NATIVE-COUNTRY', 'INCOME']
url = 'https://archive.ics.uci.edu/ml/machine-learning- 
databases/adult/adult.data'
adults = pd.read_csv(url, names=colnames, header=None)

adults['CAPITAL-GAINS'] = (adults['CAPITAL-GAIN'] - adults['CAPITAL-LOSS'])

adults = adults.drop(['RELATIONSHIP', 'FNLWGT', 'EDU-NUM', 'MARITAL-STATUS', 
'CAPITAL-GAIN', 'CAPITAL-LOSS'], axis=1)
#rearrange the columns to make it easier to set X
adults = adults[['AGE', 'WORKCLASS','EDU','JOB','RACE', 'SEX','HOURS-PER- 
WEEK', 'NATIVE-COUNTRY', 'CAPITAL-GAINS', 'INCOME']]
adults.replace({'?': 0}, inplace=True)
#assign the X and y arrays using numpy
X = np.array(adults.ix[:,0:9])
y = np.array(adults['INCOME'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
knn = KNeighborsClassifier()
knn.fit(X_train ,y_train)
pred = knn.predict(X_test)
print (accuracy_score(y_test, pred))

traceback:

Traceback (most recent call last):
  File "C:/Users/nolan/OneDrive/Desktop/digits.py", line 37, in <module>
    knn.fit(X_train ,y_train)
  File "C:\Program Files\Python\lib\site-packages\sklearn\neighbors\base.py", line 765, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 573, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: ' Peru'

all the data looks like this:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0

is there a way to set numPy to hold this data with the conversion error?

can you share all the code required to get this error locally? — Azsgy
– Azsgy, Commented Apr 5, 2018 at 21:37
What kind of input does knn.fit expect? Can it work with strings? Or just numeric values? — hpaulj
– hpaulj, Commented Apr 5, 2018 at 22:23

desertnaut · Accepted Answer · 2018-04-05 23:58:17Z

2

There is not any numpy conversion error here; the issue is simply than the k-nn algorithm cannot handle categorical features. It is true that this is not explicitly mentioned in the scikit-learn documentation, but it follows directly if you have even a rough idea of what the algorithm does, which is computing distances between the data points, so that it can subsequently find the k nearest ones, hence the name. And since there is not any (simple & general) way to compute distances between categorical features, the algorithm is simply not applicable in such cases.

See also this answer at Data Science Stack Exchange.

edited Apr 5, 2018 at 23:58

answered Apr 5, 2018 at 23:46

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

blacker · Accepted Answer · 2018-04-06 00:07:14Z

0

you should change the classifier, if possible. SVM and neural networks support this type of data, but KNN not suport this.

answered Apr 6, 2018 at 0:07

blacker

7981 gold badge10 silver badges13 bronze badges

Collectives™ on Stack Overflow

Numpy array conversion error

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related