2

I'm using a Logistic Regression model on my data. From what I understand (e.g. from here: Pandas vs. Numpy Dataframes), it's better to use numpy.ndarray with sklearn than to use Pandas Dataframes. This can be done by using the .values attribute on the dataframe. I have done this, but get the ValueError: Specifying the columns using strings is only supported for pandas DataFrames. Clearly, I am doing something wrong with my code. Any insights are much appreciated.

Funnily enough, my code works when I don't use .values, and just use X as a DataFrame and y as a Pandas Series.

# We will train our classifier with the following features:
# Numeric features to be scaled: LIMIT_BAL, AGE, PAY_X, BIL_AMTX, and PAY_AMTX
# Categorical features: SEX, EDUCATION, MARRIAGE

# We create the preprocessing pipelines for both numeric and categorical data
numeric_features = ['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 
                 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 
                 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

data['PAY_1'] = data.PAY_1.astype('float64')
data['PAY_2'] = data.PAY_2.astype('float64')
data['PAY_3'] = data.PAY_3.astype('float64')
data['PAY_4'] = data.PAY_4.astype('float64')
data['PAY_5'] = data.PAY_5.astype('float64')
data['PAY_6'] = data.PAY_6.astype('float64')
data['AGE'] = data.AGE.astype('float64')


numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])

categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(categories='auto'))
])

preprocessor = ColumnTransformer(
transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

y = data['default'].values
X = data.drop('default', axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
random_state=10, stratify=y)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
lr = Pipeline(steps=[('preprocessor', preprocessor),
                 ('classifier', LogisticRegression(solver='liblinear'))])

param_grid_lr = {
'classifier__C': np.logspace(-5, 8, 15)
}

lr_cv = GridSearchCV(lr, param_grid_lr, cv=10, iid=False)

lr_cv.fit(X_train, y_train)

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

1
  • added the code for the preprocessor Commented Jan 21, 2019 at 17:06

1 Answer 1

2

You are using ColumnTransformer as if you had a dataframe, but you don't have one...

column(s) : string or int, array-like of string or int, slice, boolean mask array or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above.

If you pass strings for the columns, you need to pass a dataframe. If you want to use a numpy array, then first the transtyping may not be required and you need to specify integers and not strings as index.

Sign up to request clarification or add additional context in comments.

2 Comments

thank you, Matthieu. Is there any benefit to using a numpy array over a pandas dataframe? I can't see that there would be, I'm just wondering.
Depends on the algorithms after, a contiguous array might be better for some models.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.