5

I'm running the following script:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
dataset = pd.read_csv('data/50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
onehotencoder = OneHotEncoder(categorical_features=3, 
handle_unknown='ignore')
onehotencoder.fit(X)

The data head looks like: data

And I've got this:

ValueError: could not convert string to float: 'New York'

I read the answers to similar questions and then opened scikit-learn documentations, but how you can see scikit-learn authors doesn't have issues with spaces in strings

I know that I can use LabelEncocder from sklearn.preprocessing and then use OHE and it works well, but in that case

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)

massage occurs.

You can use full csv file or

[[165349.2, 136897.8, 471784.1, 'New York', 192261.83],
[162597.7, 151377.59, 443898.53, 'California', 191792.06],
[153441.51, 101145.55, 407934.54, 'Florida', 191050.39],
[144372.41, 118671.85, 383199.62, 'New York', 182901.99],
[142107.34, 91391.77, 366168.42, 'Florida', 166187.94]]

5 first lines to test this code.

5
  • My input, as you can see from code, is csv file Commented Nov 26, 2018 at 0:14
  • 1
    try: dataset.info() to check the types of data that you have in your dataframe. Commented Nov 26, 2018 at 0:20
  • 1
    I've add 5 first lines and link to pastebin with full content of the file Commented Nov 26, 2018 at 0:29
  • The 'State' column full of 50 non-null objects. Now I see the problem, but anyway have no idea how to fix it without using LabelEncoder Commented Nov 26, 2018 at 0:31
  • What would you expect 'New York' to be as a floating point number? Why would you think it has anything to do with a space in the string? Commented Nov 26, 2018 at 0:33

2 Answers 2

4

It is categorical_features=3 that hurts you. You cannot use categorical_features with string data. Remove this option, and luck will be with you. Also, you probably need fit_transform, not fit as such.

onehotencoder = OneHotEncoder(handle_unknown='ignore')
transformed = onehotencoder.fit_transform(X[:, [3]]).toarray()
X1 = np.concatenate([X[:, :2], transformed, X[:, 4:]], axis=1)
#array([[165349.2, 136897.8, 0.0, '0.0, 1.0, 192261.83],
#       [162597.7, 151377.59, 1.0, 0.0, 0.0, 191792.06],
#       [153441.51, 101145.55, 0.0, 1.0, 0.0, 191050.39],
#       [144372.41, 118671.85, 0.0, 0.0, 1.0, 182901.99],
#       [142107.34, 91391.77, 0.0, 1.0, 0.0, 166187.94']])
Sign up to request clarification or add additional context in comments.

8 Comments

In that case the whole dataset tranforms to categorical data, not only 3d column
You can choose which columns to transform.
I ran this code: onehotencoder = OneHotEncoder(handle_unknown='ignore') onehotencoder.fit(X[:, 3]) and got this error: ValueError: Expected 2D array, got 1D array instead:
Because you pass a 1D array instead of a 2D array. You ought to pass X[:, [3]] or X[:,3].reshape(1,-1).
You have to combine the transformed columns with the original columns. I am afraid that your understanding of how Python (and Numpy) works is still insufficient for carrying out complex tasks, and strongly suggest that you read a good numpy tutorial.
|
0

Try this:

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder

columntransformer = make_column_transformer(
(OneHotEncoder(categories='auto'), [3]),
    remainder='passthrough')


X = columntransformer.fit_transform(X)
X = X.astype(float)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.