Getting scikit learn to work with pandas

Question

Just getting started in ML and was needing some help with getting sklearn to work with pandas.

http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection-as-part-of-a-pipeline

I was reading this and decided to try it out with a DataFrame I had. Below is what I did, and the error that came from it. I'm pretty new to all of this so excuse me if I'm overlooking something dumb, but I thought it would be better to ask here versus try to hack away to find an answer without really understanding it.

Thanks guys!

In [518]: cols = ['A','B','C','D','E','F','G','H','I','J','K']

In [519]: x = df['Miss'].values

In [520]: y = df[list(cols)].values

In [532]: y.shape
Out[532]: (11345, 11)

In [533]: x.shape
Out[533]: (11345,)

clf = Pipeline([
  ('feature_selection', LinearSVC(penalty="l1", dual=False)),
  ('classification', RandomForestClassifier())])

In [536]: clf.fit(x,y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/cschwalbach/as_research_repo/logs/<ipython-input-536-5c1831092d7a> in <module>()
----> 1 clf.fit(x,y)

/usr/lib64/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
    124         data, then fit the transformed data using the final estimator.
    125         """
--> 126         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    127         self.steps[-1][-1].fit(Xt, y, **fit_params)
    128         return self

/usr/lib64/python2.7/site-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
    114         for name, transform in self.steps[:-1]:
    115             if hasattr(transform, "fit_transform"):
--> 116                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    117             else:
    118                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/usr/lib64/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
    362         else:
    363             # fit method of arity 2 (supervised transformation)

--> 364             return self.fit(X, y, **fit_params).transform(X)
    365
    366

/usr/lib64/python2.7/site-packages/sklearn/svm/base.pyc in fit(self, X, y)
    684             raise ValueError("X and y have incompatible shapes.\n"
    685                              "X has %s samples, but y has %s." %
--> 686                              (X.shape[0], y.shape[0]))
    687
    688         liblinear.set_verbosity_wrap(self.verbose)

ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 124795.

I believe the scikit-learn api is usually *.fit(X, y) where X is the N x d array with N observations` and d features. So you want to swap your x and y. You should redefine them to be consistent with everyone else instead of clf.fit(y, x). — TomAugspurger
– TomAugspurger, Commented Apr 15, 2014 at 23:21
Yes, this looks like it is the problem. Let us know if that fixes it. — eickenberg
– eickenberg, Commented Apr 16, 2014 at 8:29
So dumb! Thanks so much guys! I'm happy it was that easy and not something much harder. @larsmans, I didn't need to transpose anything — user1610719
– user1610719, Commented Apr 16, 2014 at 14:20

waitingkuo · Accepted Answer · 2014-04-16 09:15:42Z

3

Most of people use X as features and y as label. Unluckily, you are in a contrary way. So you might get confused in the documentation.

Use followings instead

In [519]: y = df['Miss'].values

In [520]: X = df[list(cols)].values

Then you can fit the model by clf.fit(X, y)

answered Apr 16, 2014 at 9:15

waitingkuo

94.5k28 gold badges119 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Getting scikit learn to work with pandas

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related