2

I want to build a sklearn Pipeline (part of a further larger Pipeline), which :

  1. encode categorical columns (OneHotEncoder)
  2. reduce dimension (SVD)
  3. add numerical columns (without transformation)
  4. aggregate lines (pandas groupby)

I used this pipeline example :

and this example for custom TranformerMixin :

I get an error at step 4 (no error if I comment step 4) :

AttributeError Traceback (most recent call last) in () ----> 1 X_train_transformed = pipe.fit_transform(X_train) .... AttributeError: 'numpy.ndarray' object has no attribute 'fit'

My code :

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import TruncatedSVD
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer

# does nothing, but is here to collect numerical columns
class nothing(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):       

        return self

    def transform(self, X):          

        return X


class Aggregator(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = pd.DataFrame(X)
        X = X.rename(columns = {0 :'InvoiceNo', 1 : 'amount', 2:'Quantity', 
                                3:'UnitPrice',4:'CustomerID' })
        X['InvoiceNo'] =  X['InvoiceNo'].astype('int')
        X['Quantity'] = X['Quantity'].astype('float64')
        X['UnitPrice'] = X['UnitPrice'].astype('float64')
        aggregations = dict()
        for col in range(5, X.shape[1]-1) :
            aggregations[col] = 'max'

        aggregations.update({ 'CustomerID' : 'first',
                            'amount' : "sum",'Quantity' : 'mean', 'UnitPrice' : 'mean'})

        # aggregating all basket lines
        result = X.groupby('InvoiceNo').agg(aggregations)

        # add number of lines in the basket
        result['lines_nb'] = X.groupby('InvoiceNo').size()
        return result

        numeric_features = ['InvoiceNo','amount', 'Quantity', 'UnitPrice', 
                           'CustomerID']
        numeric_transformer = Pipeline(steps=[('nothing', nothing())])

        categorical_features = ['StockCode', 'Country']   

        preprocessor =  ColumnTransformer(
        [
        # 'num' transformer does nothing, but is here to  
        # collect numerical columns
        ('num', numeric_transformer ,numeric_features ),
        ('cat', Pipeline([
            ('onehot', OneHotEncoder(handle_unknown='ignore')),
            ('best', TruncatedSVD(n_components=100)),
         ]), categorical_features)        
          ]
          )

# edit with Artem solution
# aggregator = ('agg', Aggregator())

pipe = Pipeline(steps=[
                      ('preprocessor', preprocessor),
                      # edit with Artem solution
                      # ('aggregator', aggregator),
                      ('aggregator', Aggregator())
                     ])

X_train_transformed = pipe.fit_transform(X_train)
7
  • 2
    Could please add some reproducible example for your issue by using sample data. Commented Jan 25, 2019 at 6:57
  • Did you try to cut the problem down ? If you return X in Aggregator.transform() do you have an error ? If not, then the problem does not come from the pipeline. Commented Jan 25, 2019 at 7:21
  • 1
    It looks like an element of your pipeline should return an estimator but returned a numpy.ndarray instead. You may want to try running the Aggregator.transform() by itself to see if it returns the expected result. Commented Jan 25, 2019 at 7:35
  • 1
    what do you refer to as step 4? Also I can see at least one problem - in your pipe instantiation, aggreagator is a tuple, while should be a class, I thnk, i.e. try ('aggregator', Aggregator()) Commented Jan 25, 2019 at 10:26
  • 1
    @AI_Learning yes, this is a good advice. Next time I'll make sure to add a reproductibe example Commented Jan 25, 2019 at 20:52

1 Answer 1

1

Pipeline steps are in from ('name', Class), but original task had essentially:

aggregator = ('agg', Aggregator())`

pipe = Pipeline(steps=[
                      ('preprocessor', preprocessor),
                      ('aggregator', aggregator),
])

which made it ('aggregator', ('agg', Aggregator()))

Sign up to request clarification or add additional context in comments.

1 Comment

thanks, I have edited my code as below and the pipeline can now be entirely executed. pipe = Pipeline(steps=[ ('preprocessor', preprocessor), ('aggregator', Aggregator()), ]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.