1

Just getting started in ML and was needing some help with getting sklearn to work with pandas.

http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection-as-part-of-a-pipeline

I was reading this and decided to try it out with a DataFrame I had. Below is what I did, and the error that came from it. I'm pretty new to all of this so excuse me if I'm overlooking something dumb, but I thought it would be better to ask here versus try to hack away to find an answer without really understanding it.

Thanks guys!

In [518]: cols = ['A','B','C','D','E','F','G','H','I','J','K']

In [519]: x = df['Miss'].values

In [520]: y = df[list(cols)].values

In [532]: y.shape
Out[532]: (11345, 11)

In [533]: x.shape
Out[533]: (11345,)

clf = Pipeline([
  ('feature_selection', LinearSVC(penalty="l1", dual=False)),
  ('classification', RandomForestClassifier())])

In [536]: clf.fit(x,y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/cschwalbach/as_research_repo/logs/<ipython-input-536-5c1831092d7a> in <module>()
----> 1 clf.fit(x,y)

/usr/lib64/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
    124         data, then fit the transformed data using the final estimator.
    125         """
--> 126         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    127         self.steps[-1][-1].fit(Xt, y, **fit_params)
    128         return self

/usr/lib64/python2.7/site-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
    114         for name, transform in self.steps[:-1]:
    115             if hasattr(transform, "fit_transform"):
--> 116                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    117             else:
    118                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/usr/lib64/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
    362         else:
    363             # fit method of arity 2 (supervised transformation)

--> 364             return self.fit(X, y, **fit_params).transform(X)
    365
    366

/usr/lib64/python2.7/site-packages/sklearn/svm/base.pyc in fit(self, X, y)
    684             raise ValueError("X and y have incompatible shapes.\n"
    685                              "X has %s samples, but y has %s." %
--> 686                              (X.shape[0], y.shape[0]))
    687
    688         liblinear.set_verbosity_wrap(self.verbose)

ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 124795.
3
  • 3
    I believe the scikit-learn api is usually *.fit(X, y) where X is the N x d array with N observations` and d features. So you want to swap your x and y. You should redefine them to be consistent with everyone else instead of clf.fit(y, x). Commented Apr 15, 2014 at 23:21
  • Yes, this looks like it is the problem. Let us know if that fixes it. Commented Apr 16, 2014 at 8:29
  • So dumb! Thanks so much guys! I'm happy it was that easy and not something much harder. @larsmans, I didn't need to transpose anything Commented Apr 16, 2014 at 14:20

1 Answer 1

3

Most of people use X as features and y as label. Unluckily, you are in a contrary way. So you might get confused in the documentation.

Use followings instead

In [519]: y = df['Miss'].values

In [520]: X = df[list(cols)].values

Then you can fit the model by clf.fit(X, y)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.