3

The example is fully reproducible. Here is full notebook (which downloads data too): https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb

After this part in notebook above:

full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("linear", LinearRegression())
    ])

full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)

I am trying to get predictions on the test set with this code:

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)

But I am receiving error:

C:\Users\Alex\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py:430: FutureWarning: Given feature/column names or counts do not match the ones for the data given during fit. This will fail from v0.24.
  FutureWarning)
---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    796             try:
--> 797                 tasks = self._ready_batches.get(block=False)
    798             except queue.Empty:

~\AppData\Local\Continuum\anaconda3\lib\queue.py in get(self, block, timeout)
    166                 if not self._qsize():
--> 167                     raise Empty
    168             elif timeout is None:

Empty: 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-141-dc87b1c9e658> in <module>
      5 
      6 X_test_prepared = full_pipeline.transform(X_test)
----> 7 final_predictions = full_pipeline_with_predictor.predict(X_test_prepared)
      8 
      9 final_mse = mean_squared_error(y_test, final_predictions)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
    417         Xt = X
    418         for _, name, transform in self._iter(with_final=False):
--> 419             Xt = transform.transform(Xt)
    420         return self.steps[-1][-1].predict(Xt, **predict_params)
    421 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in transform(self, X)
    586 
    587         self._validate_features(X.shape[1], X_feature_names)
--> 588         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    589         self._validate_output(Xs)
    590 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    455                     message=self._log_message(name, idx, len(transformers)))
    456                 for idx, (name, trans, column, weight) in enumerate(
--> 457                         self._iter(fitted=fitted, replace_strings=True), 1))
    458         except ValueError as e:
    459             if "Expected 2D array, got 1D array instead" in str(e):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1002             # remaining jobs.
   1003             self._iterating = False
-> 1004             if self.dispatch_one_batch(iterator):
   1005                 self._iterating = self._original_iterator is not None
   1006 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    806                 big_batch_size = batch_size * n_jobs
    807 
--> 808                 islice = list(itertools.islice(iterator, big_batch_size))
    809                 if len(islice) == 0:
    810                     return False

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in <genexpr>(.0)
    454                     message_clsname='ColumnTransformer',
    455                     message=self._log_message(name, idx, len(transformers)))
--> 456                 for idx, (name, trans, column, weight) in enumerate(
    457                         self._iter(fitted=fitted, replace_strings=True), 1))
    458         except ValueError as e:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\__init__.py in _safe_indexing(X, indices, axis)
    404     if axis == 1 and indices_dtype == 'str' and not hasattr(X, 'loc'):
    405         raise ValueError(
--> 406             "Specifying the columns using strings is only supported for "
    407             "pandas DataFrames"
    408         )

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Question: How can I correct that error? And why that error happens?

0

1 Answer 1

5

Since your final pipeline:

full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("linear", LinearRegression())
    ])

clearly contains already the full_pipeline, you should not "prepare" again your X_test; doing so, you are "preparing" X_test twice, which is wrong. So, your code should be simply

final_predictions = full_pipeline_with_predictor.predict(X_test)

exactly as it is for getting predictions for some_data, i.e.

full_pipeline_with_predictor.predict(some_data)

which some_data you correctly do not "prepare" before feeding them into the final pipeline.

The whole point of using pipelines is exactly this, i.e. to avoid having to run separately fit-predict for possibly several preparation steps, having wrapped all of them into a single pipeline instead. You correctly apply this process here when you predict some_data, but you somehow seem to have forgotten it in the next step, when you try to predict X_test.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.