4

I have a basic linear regression with 80 numerical variables (no classification variables). Training set has 1600 rows, testing 700.

I would like a python package that iterates through all column combinations to find the best custom score function or an out of the box score funtion like AIC. OR If that doesnt exist, what do people here use for variable selection? I know R has some packages like this but dont want deal with Rpy2

I have no preference if the LM requires scikit learn, numpy, pandas, statsmodels, or other.

1

1 Answer 1

4

I can suggest an answer that using the Least Absolute Shrinkage and Selection Operator(Lasso). I didn't use in a situation like you, that you have to deal with so many data.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

I often write a code to do linear regression with statsmodels like below,

import statsmodels.api as sm

model = sm.OLS()
results = model.fit(train_X,train_Y)

If I want to do Lasso regression, I write a code like below,

from sklearn import linear_model

model = linear_model.Lasso(alpha=1.0(default))
results = model.fit(train_X,train_Y)

You have to decide appropriate alpha between 0.0 and 1.0. The parameter is determined by how you don't accept the error.

Try this.

Sign up to request clarification or add additional context in comments.

1 Comment

This may theoretically answer the question, but it would be best to include the essential parts of the answer here for future users, and provide the link for reference. Link-dominated answers can become invalid through link rot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.