1

I need to create a custom transformer to be input into a grader.

The grader passes a list of dictionaries to the predict or predict_proba method of my estimator, not a DataFrame. This means that the model must work with both data types. For this reason, I need to provide a custom ColumnSelectTransformer to use instead scikit-learn's own ColumnTransformer.

This is my code for the custom transformer that aims to drop null values in the columns provided.

simple_cols = ['BEDCERT', 'RESTOT', 'INHOSP', 'CCRC_FACIL', 'SFF', 'CHOW_LAST_12MOS', 'SPRINKLER_STATUS', 'EXP_TOTAL', 'ADJ_TOTAL']

class ColumnSelectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        X.dropna(inplace=True)
        return X[self.columns].values()

simple_features = Pipeline([
    ('cst', ColumnSelectTransformer(simple_cols)),
])

However, I am unable to pass the following assertion tests

assert data['RESTOT'].isnull().sum() > 0
assert not np.isnan(simple_features.fit_transform(data)).any()

I generate a typeerror

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-44-922f08231b1f> in <module>()
      1 assert not data['RESTOT'].isnull().sum() > 0
----> 2 assert not np.isnan(simple_features.fit_transform(data)).any()

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    391                 return Xt
    392             if hasattr(last_step, 'fit_transform'):
--> 393                 return last_step.fit_transform(Xt, y, **fit_params)
    394             else:
    395                 return last_step.fit(Xt, y, **fit_params).transform(Xt)

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    551         if y is None:
    552             # fit method of arity 1 (unsupervised transformation)
--> 553             return self.fit(X, **fit_params).transform(X)
    554         else:
    555             # fit method of arity 2 (supervised transformation)

<ipython-input-42-e20ea4310864> in transform(self, X)
     12             X = pd.DataFrame(X)
     13         X.dropna(inplace=True)
---> 14         return X[self.columns].values()
     15 
     16 simple_features = Pipeline([

TypeError: 'numpy.ndarray' object is not callable

Here is the actual data if anyone wants access.

%%bash
mkdir data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-train.csv -nc -P ./ml-data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-metadata.csv -nc -P ./ml-data

data = pd.read_csv('./ml-data/providers-train.csv', encoding='latin1')

1 Answer 1

2

As the log points out, the error is in X[self.columns].values(). values is a numpy array, so you cannot call it as a method (put parenthesis after it). You should try X[self.columns].values.

Sign up to request clarification or add additional context in comments.

1 Comment

Please notice that OP does not have the necessary reputation for the upvoting privilege (15 pts)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.