0

I have a dataset with string and float data. numPy tries to convert everything to a float, giving the error "cannot convert string to float"

import numpy as np
import scipy
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

pd.set_option('display.height', 750)
pd.set_option('display.width', 750)

colnames = ['AGE', 'WORKCLASS', 'FNLWGT','EDU','EDU-NUM','MARITAL- 
STATUS','JOB','RELATIONSHIP','RACE', 'SEX', 'CAPITAL-GAIN', 'CAPITAL- 
LOSS','HOURS-PER-WEEK', 'NATIVE-COUNTRY', 'INCOME']
url = 'https://archive.ics.uci.edu/ml/machine-learning- 
databases/adult/adult.data'
adults = pd.read_csv(url, names=colnames, header=None)

adults['CAPITAL-GAINS'] = (adults['CAPITAL-GAIN'] - adults['CAPITAL-LOSS'])

adults = adults.drop(['RELATIONSHIP', 'FNLWGT', 'EDU-NUM', 'MARITAL-STATUS', 
'CAPITAL-GAIN', 'CAPITAL-LOSS'], axis=1)
#rearrange the columns to make it easier to set X
adults = adults[['AGE', 'WORKCLASS','EDU','JOB','RACE', 'SEX','HOURS-PER- 
WEEK', 'NATIVE-COUNTRY', 'CAPITAL-GAINS', 'INCOME']]
adults.replace({'?': 0}, inplace=True)
#assign the X and y arrays using numpy
X = np.array(adults.ix[:,0:9])
y = np.array(adults['INCOME'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
knn = KNeighborsClassifier()
knn.fit(X_train ,y_train)
pred = knn.predict(X_test)
print (accuracy_score(y_test, pred))

traceback:

Traceback (most recent call last):
  File "C:/Users/nolan/OneDrive/Desktop/digits.py", line 37, in <module>
    knn.fit(X_train ,y_train)
  File "C:\Program Files\Python\lib\site-packages\sklearn\neighbors\base.py", line 765, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 573, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: ' Peru'

all the data looks like this:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0

is there a way to set numPy to hold this data with the conversion error?

3
  • 2
    can you share all the code required to get this error locally? Commented Apr 5, 2018 at 21:37
  • 1
    Share your traceback and a subsection of your data as well Commented Apr 5, 2018 at 21:45
  • What kind of input does knn.fit expect? Can it work with strings? Or just numeric values? Commented Apr 5, 2018 at 22:23

2 Answers 2

2

There is not any numpy conversion error here; the issue is simply than the k-nn algorithm cannot handle categorical features. It is true that this is not explicitly mentioned in the scikit-learn documentation, but it follows directly if you have even a rough idea of what the algorithm does, which is computing distances between the data points, so that it can subsequently find the k nearest ones, hence the name. And since there is not any (simple & general) way to compute distances between categorical features, the algorithm is simply not applicable in such cases.

See also this answer at Data Science Stack Exchange.

Sign up to request clarification or add additional context in comments.

Comments

0

you should change the classifier, if possible. SVM and neural networks support this type of data, but KNN not suport this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.