1

enter image description hereI am trying to solve the decision tree problem in python using scikit_learn and pandas. The data set is available in CSV file. When I try to load data in python, I get an error that says "ValueError: could not convert string to float: 'CustomerID'". I don't know what I have done wrong in code.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
col_names=['CustomerID','Gender','Car Type', 'Shirt Size','Class']
pima=pd.read_csv("F:\Current semster courses\Machine 
Learning\ML_A1_Fall2019\Q2_dataset.csv",header=None, names=col_names)
pima.head()
feature_cols=['CustomerID','Gender','Car Type', 'Shirt Size']
X=pima[feature_cols]
y=pima.Class
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Can someone tell me what I am doing wrong?

Dataset:

CustomerID  Gender  Car Type    Shirt Size  Class
1            M      Family       Small      C0
2            M      Sports       Medium     C0
3            M      Sports       Medium     C0
4            M      Sports       Large      C0
5            M      Sports     Extra Large  C0
6            M      Sports     Extra Large  C0
7            F      Sports       Small      C0
8            F      Sports       Small      C0
9            F      Sports       Medium     C0
10           F      Luxury       Large      C0
11           M      Family       Large      C1
12           M      Family     Extra Large  C1
13           M      Family       Medium     C1
14           M      Luxury    Extra Large   C1
15           F      Luxury       Small      C1
16           F      Luxury       Small      C1
17           F      Luxury       Medium     C1
18           F      Luxury       Medium     C1
19           F      Luxury       Medium     C1
20           F      Luxury       Large      C1
6
  • Can you provide a few lines of the CSV, or even upload the full file somewhere - so we can recreate the issue. Commented Oct 19, 2019 at 14:46
  • I have added the screenshot of my data Commented Oct 19, 2019 at 15:13
  • Do you mind pasting it in as text as well, so I can copy-paste it? Commented Oct 19, 2019 at 15:25
  • i have added the dataset Commented Oct 19, 2019 at 15:28
  • why can't you just do pd.read_csv('file.csv')? reads fine for me? Commented Oct 19, 2019 at 16:14

1 Answer 1

1

Ah. OK. The issue is that your data is categorical data, which scikit can't work with directly. It first needs to be converted to numeric data. The method ._get_dummies() does this by taking a single column with multiple categorical values, and converting it to multiple columns, each containing a numeric 1 or 0 indicating whether which category is "True".

As an aside, you should remove the "Customer ID" column from the features. It is a random value that has no bearing on whether the row belongs to one class or another.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

col_names=['CustomerID','Gender','Car Type', 'Shirt Size','Class']
data = [['1',  'M', 'Family', 'Small',      'C0'], 
        ['2',  'M', 'Sports', 'Medium',     'C0'], 
        ['3',  'M', 'Sports', 'Medium',     'C0'], 
        ['4',  'M', 'Sports', 'Large',      'C0'], 
        ['5',  'M', 'Sports', 'Extra Large','C0'], 
        ['6',  'M', 'Sports', 'Extra Large','C0'], 
        ['7',  'F', 'Sports', 'Small',      'C0'], 
        ['8',  'F', 'Sports', 'Small',      'C0'], 
        ['9',  'F', 'Sports', 'Medium',     'C0'], 
        ['10', 'F', 'Luxury', 'Large',      'C0'], 
        ['11', 'M', 'Family', 'Large',      'C1'], 
        ['12', 'M', 'Family', 'Extra Large','C1'], 
        ['13', 'M', 'Family', 'Medium',     'C1'], 
        ['14', 'M', 'Luxury', 'Extra Large','C1'], 
        ['15', 'F', 'Luxury', 'Small',      'C1']]

#pima=pd.read_csv("F:\Current semster courses\Machine ...
pima=pd.DataFrame(data, columns = col_names)
# Convert the categorical data to multiple columns of numerical data for the decision tree
pima = pd.get_dummies(pima, prefix=['CustomerID','Gender','Car Type', 'Shirt Size','Class'])
print(pima)

#feature_cols=['CustomerID','Gender','Car Type','Shirt Size']
feature_cols=['Gender_F', 'Gender_M',
       'Car Type_Family', 'Car Type_Luxury', 'Car Type_Sports',
       'Shirt Size_Extra Large', 'Shirt Size_Large', 'Shirt Size_Medium',
       'Shirt Size_Small', 'Class_C0', 'Class_C1']
X=pima[feature_cols]
y=pima[['Class_C0', 'Class_C1']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print("X_train =", X_train) 
print("X_test =", X_test) 
print("y_train =", y_train)
print("y_test =", y_test )
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Sign up to request clarification or add additional context in comments.

1 Comment

I don't want my data to be treated as float or intenger, I want it to be treated as strings

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.