2

I am trying to do feature selection using scikit-learn library. My data is simple. Rows are samples and Columns are features. Though original Class label is X and Y, I changed them to numeric for linear regression, X to 0 and Y to 1.

G1  G2  G3  ... Gn Class
1.0 4.0 5.0 ... 1.0 0
4.0 5.0 9.0 ... 1.0 0
9.0 6.0 3.0 ... 2.0 1
...

I used library sklearn.linear_model.LinearRegression(), and it was performed well. Now I am using coef_ value for feature selection. In this case, I have 2 questions.

Is it right to use the coef_ value of features? Or are there some other better parameters for feature selection in LinearRegression()?

In addition, is there some kind of rule to decide proper threshold(for example, minimum value of coef_ for feature selection)?

6
  • 1
    you may want to google lasso :) Commented Dec 10, 2015 at 13:44
  • @cel As I know, lasso is a model using L1 penalty. However, although I use lasso, there will be the same problem in deciding proper threshold. I may select Top-N rank features, but I want to find some border line(certain value) for decision. Commented Dec 10, 2015 at 14:01
  • 2
    The advantage of the L1 penalty is, that it prefers 0-valued coefficients. It performs feature selection for you, by setting the coefficient of unimportant features to 0. You just need to set the regularization parameter high enough until you are satisfied with the feature number vs accuracy trade-off. Then you won't need any threshold since the coefficient is already 0. Commented Dec 10, 2015 at 14:11
  • @Robin Spiess Thank you for your advice. In fact I used LassoRegression() before. However, as the number of features in dataset is large(about 50,000), there are still many features whose coef_ >0, and sometimes even 'Convergence warning' was occured, so I wanted to find some more tight threshold. Commented Dec 10, 2015 at 14:27
  • 1
    I don't know of any formal thresholding rules. What I would do is try multiple thresholds and do cross validation with the resulting feature set. Basically: Set all feature_coeffs < thresh to 0, then retrain the model only using the features which still have a non-zero coefficient on a subset of your data and test the resulting model on the rest of your data. Adjust threshold until you're happy with the results. Commented Dec 10, 2015 at 15:24

1 Answer 1

6

Simply deciding based on coefficient value is plainly illogical. This is because unless you data is normalized value of coefficient do not indicate anything.

For ex: suppose one of the feature ranges from (0,1) and its coefficient is say 0.5 while another ranges from (0,10000) and its coefficient is 0.5. Clearly the weight of later feature is much more due to bigger range in generating final output.

So, generally what is suggested is to normalize features. i.e $ x' = \frac{x-mean(x)}{std(x)} $. and then decide based on value of coefficients.

Note : To make prediction remember to transform the features.

This might not always work as normalizing may distort the features. There are other heuristics too. You can read them elsewhere.

Another way is by elimination, eliminate features one by one and see how important they are. This can be done by checking the p-value in case of regression or simply the error of fit (sum of squares).

A suggestion : Seems like you are using linear regression for a classification problem, which is again principally wrong as linear regression assumes output y is continuous where as here y is 0 or 1. You might want to use logistic regression instead.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you very much! Because I did not think of the normalization, I was in struggle. Your advice is extremely helpful and I could learn much from you. I want to ask 1 more question if you don't mind. May I use Lasso or Ridge regression for feature selection, although the label is 0 and 1? I though I may be handled like binary value, but it seems to be wrong.
It depends on what you want.. Lasso do not penalizes large value of weights. Hence you may get some zeros and some very large values. Ridge regression penalizes large weights and hence you will get many weights which are close to zero. So, it kinda depends on trial and error and what data you have. If you are expecting very few non-zero weights, try lasso.
Thank you again for your explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.