Use pipeline with custom transformer in Scikit Learn

Question

I tried to transform the column 'X' using values in column 'y' (this is a toy example, just to show using y for transformation) before fitted by the last linear regression estimator. But why df['y'] is not passed to MyTransformer?

from sklearn.base import TransformerMixin
class MyTransformer(TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        print(y)
        return X + np.sum(y)

df = pd.DataFrame(np.array([[2, 3], [1, 5], [1, 1], [5, 6], [1, 2]]), columns=['X', 'y'])
pip =  Pipeline([('my_transformer', MyTransformer()), 
             ('sqrt', FunctionTransformer(np.sqrt, validate=False)),
             ('lr', LinearRegression())])
pip.fit(df[['X']], df['y'])

Running this script will raise an error at line return X + np.sum(y), looks like y is None.

DasHund · Accepted Answer · 2019-07-11 00:59:21Z

1

As stated previously, the fit_transform method doesn't pass y off to transform. What I've done previously is implement my own fit_transform. Not your code, but here's an example I wrote recently:

class MultiColumnLabelEncoder:
    def __init__(self, *args, **kwargs):
        self.encoder = StandardLabelEncoder(*args, **kwargs)
    def fit(self, X, y=None):
        return self
    def transform(self,X):
        data = X.copy()
        for i in range(data.shape[1]):
            data[:, i] = LabelEncoder().fit_transform(data[:, i])
        return data
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

There are other ways. You could have y as a class param and access it in the transform method.

Edit: I should note that you can pass y off to your version of transform. So:

def fit_transform(self, X, y=None):
    return self.fit(X, y).transform(X, y)

edited Jul 11, 2019 at 0:59

answered Jul 11, 2019 at 0:53

DasHund

814 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nicholas Over a year ago

So basically just use Python's duck typing to bypass TransformerMixin, and everything should be fine, right?

DasHund Over a year ago

You can also extend TransformerMixin, but don't have to. If you do, you get the fit_transform method. The key is really just to override it if you do extend it. Here's the source for that method in sklearn, btw: github.com/scikit-learn/scikit-learn/blob/…

wangtianye · Accepted Answer · 2019-07-11 00:49:13Z

0

The following statement in TransformerMixin will execute ,We can see that transform function only need X parameter

self.fit(X, y, **fit_params).transform(X)

answered Jul 11, 2019 at 0:49

wangtianye

3061 silver badge5 bronze badges

Collectives™ on Stack Overflow

Use pipeline with custom transformer in Scikit Learn

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related