I'm trying to use SKLearn 0.20.2 to make a pipeline while using the new ColumnTransformer feature. My problem is that I keep getting the error:
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
I have a column of blobs of text called, text. All of my other columns are numerical in nature. I'm trying to use the Countvectorizer in my pipeline and I think that's where the trouble is. Would much appreciate a hand with this.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules
# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
])
# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent”')),
('vect', CountVectorizer())
])
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', MultinomialNB())
])
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)
SimpleImputeris not intended for text. Trytext_transformer = Pipeline([('vect', CountVectorizer())])and see what happens.ValueError: all the input array dimensions except for the concatenation axis must match exactly. I'll update my snippet to remove the imputer.fit_transformmethod, which line,preprocessororclf, produces the error?