0

I've got a DataFrame with floats, strings, and strings that can be interpreted as dates.

Label encoding across multiple columns in scikit-learn

from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseException, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

num_attributes = ["a", "b", "c"]
num_attributes = list(df_num_median)
str_attributes = list(df_str_only)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attributes)), # transforming the Pandas DataFrame into a NumPy array
    ('imputer', Imputer(strategy="median")), # replacing missing values with the median
    ('std_scalar', StandardScaler()), # scaling the features using standardization (subtract mean value, divide by variance)
])

from sklearn.preprocessing import LabelEncoder

str_pipeline = Pipeline([
    ('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array 
    ('encoding', MultiColumnLabelEncoder(str_attributes))
])

from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    #("str_pipeline", str_pipeline) # replaced by line below
    ("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])

df_prepared = full_pipeline.fit_transform(df_combined)

The num_pipeline part of the pipeline works just fine. In the str_pipeline part I get the error

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

This doesn't happen if I comment out the MultiColumnLabelEncoder in the str_pipeline. I also created some code to apply the MultiColumnLabelEncoder on the dataset without the pipeline and it works just fine. Any ideas? As an additional step, I would have to create two separate pipelines for strings and date strings.

EDIT: added DataFrameSelector class

enter image description here

2
  • @VivekKumar It seemed solved, but I re-ran the whole thing and I get an error; see my comment on your answer Commented Aug 31, 2018 at 12:30
  • Is your problem solved now? Did you check the data? Commented Sep 6, 2018 at 10:56

1 Answer 1

1

The problem is not in the MultiColumnLabelEncoder, but in the DataFrameSelector above it in the pipeline.

You are doing this:

str_pipeline = Pipeline([
    ('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array 
    ('encoding', MultiColumnLabelEncoder(str_attributes))
])

DataFrameSelector returns .values attribute of the dataframe, which is a numpy array. So obviously when you do this in MultiColumnLabelEncoder:

...
...
    if self.columns is not None:
        for col in self.columns:
            output[col] = LabelEncoder().fit_transform(output[col])

the error is thrown by output[col]. Since output is a copy of X which is a numpy array (because it has been converted to numpy array by DataFrameSelector) and it does not have information about the column names.

Since you are already passing 'str_attributes' to MultiColumnLabelEncoder, you dont need to have DataFrameSelector in the pipeline. Just do this:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])

I have removed the str_pipeline because it had only a single transformer now (after removing DataFrameSelector).

Sign up to request clarification or add additional context in comments.

3 Comments

I re-executed the whole notebook and I get the Type Error: '<' not supported between instances of 'float' and 'str'
@Alessandro Can you please add the stack trace of new error and your new code in the question
@Alessandro Looks like a problem in data. Check if the data you send to MultiColumnLabelEncoder is all strings, or a combination of strings and numbers.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.