Using numpy.ndarray vs. Pandas Dataframe in sklearn's .fit() method

Question

I'm using a Logistic Regression model on my data. From what I understand (e.g. from here: Pandas vs. Numpy Dataframes), it's better to use numpy.ndarray with sklearn than to use Pandas Dataframes. This can be done by using the .values attribute on the dataframe. I have done this, but get the ValueError: Specifying the columns using strings is only supported for pandas DataFrames. Clearly, I am doing something wrong with my code. Any insights are much appreciated.

Funnily enough, my code works when I don't use .values, and just use X as a DataFrame and y as a Pandas Series.

# We will train our classifier with the following features:
# Numeric features to be scaled: LIMIT_BAL, AGE, PAY_X, BIL_AMTX, and PAY_AMTX
# Categorical features: SEX, EDUCATION, MARRIAGE

# We create the preprocessing pipelines for both numeric and categorical data
numeric_features = ['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 
                 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 
                 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

data['PAY_1'] = data.PAY_1.astype('float64')
data['PAY_2'] = data.PAY_2.astype('float64')
data['PAY_3'] = data.PAY_3.astype('float64')
data['PAY_4'] = data.PAY_4.astype('float64')
data['PAY_5'] = data.PAY_5.astype('float64')
data['PAY_6'] = data.PAY_6.astype('float64')
data['AGE'] = data.AGE.astype('float64')


numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])

categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(categories='auto'))
])

preprocessor = ColumnTransformer(
transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

y = data['default'].values
X = data.drop('default', axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
random_state=10, stratify=y)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
lr = Pipeline(steps=[('preprocessor', preprocessor),
                 ('classifier', LogisticRegression(solver='liblinear'))])

param_grid_lr = {
'classifier__C': np.logspace(-5, 8, 15)
}

lr_cv = GridSearchCV(lr, param_grid_lr, cv=10, iid=False)

lr_cv.fit(X_train, y_train)

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

added the code for the preprocessor

Python Developer
– Python Developer

2019-01-21 17:06:13 +00:00
Commented Jan 21, 2019 at 17:06 — Python Developer
– Python Developer, Commented Jan 21, 2019 at 17:06

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

You are using ColumnTransformer as if you had a dataframe, but you don't have one...

column(s) : string or int, array-like of string or int, slice, boolean mask array or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above.

If you pass strings for the columns, you need to pass a dataframe. If you want to use a numpy array, then first the transtyping may not be required and you need to specify integers and not strings as index.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jan 21, 2019 at 17:10

Matthieu Brucher

22.1k7 gold badges43 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Python Developer Over a year ago

thank you, Matthieu. Is there any benefit to using a numpy array over a pandas dataframe? I can't see that there would be, I'm just wondering.

Matthieu Brucher Over a year ago

Depends on the algorithms after, a contiguous array might be better for some models.

Collectives™ on Stack Overflow

Using numpy.ndarray vs. Pandas Dataframe in sklearn's .fit() method

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related