2

I am using sklearn modules to find the best fitting models and model parameters. However, I have an unexpected Index error down below:

> IndexError                                Traceback (most recent call
> last) <ipython-input-38-ea3f99e30226> in <module>
>      22             s = mean_squared_error(y[ts], best_m.predict(X[ts]))
>      23             cv[i].append(s)
> ---> 24     print(np.mean(cv, 1))
> IndexError: tuple index out of range

what I want to do is to find best fitting regressor and its parameters, but I got above error. I looked into SO and tried this solution but still, same error bumps up. any idea to fix this bug? can anyone point me out why this error happening? any thought?

my code:

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from xgboost.sklearn import XGBRegressor

from sklearn.datasets import make_regression

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]

X, y = make_regression(n_samples=10000, n_features=20)

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    cv = [[] for _ in range(len(models))]
    fold = KFold(5,shuffle=False)
    for tr, ts in fold.split(X):
        for i, (model, param) in enumerate(zip(models, params)):
            best_m = GridSearchCV(model, param)
            best_m.fit(X[tr], y[tr])
            s = mean_squared_error(y[ts], best_m.predict(X[ts]))
            cv[i].append(s)
    print(np.mean(cv, 1))

desired output:

if there is a way to fix up above error, I am expecting to pick up best-fitted models with parameters, then use it for estimation. Any idea to improve the above attempt? Thanks

3
  • @desertnaut Do you think how can I optimize this code? any better idea? Commented Jul 16, 2019 at 16:47
  • That's a very general question, but doing a grid search in each one of 5 folds sounds like overkill. I kindly suggest you open another question asking for advice in this (be sure to make your code fully reproducible, including all relevant imports). Commented Jul 16, 2019 at 16:56
  • The error can be reproduced with np.mean([],1), which supports the idea the cv is [], or contains [] lists. Commented Jul 16, 2019 at 17:59

2 Answers 2

3

When you define

cv = [[] for _ in range(len(models))]

it has an empty list for each model. In the loop, however, you go over enumerate(zip(models, params)) which has only two elements, since your params list has two elements (because list(zip(x,y)) has length equal to min(len(x),len(y)).

Hence, you get an IndexError because some of the lists in cv are empty (all but the first two) when you calculate the mean with np.mean.

Solution: If you don't need to use GridSearchCV on the remaining models you may just extend the params list with empty dictionaries:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
Sign up to request clarification or add additional context in comments.

3 Comments

I don't think this is the answer for this question. Please read SO community rule.
@Dan Since you haven't posted a MWE I can't verify with certainty that this is the solution, but it works with your code after importing the appropiate modules and it matches the output you gave in the comments for cv (see the last edit for the specific change you would have to make to params).
This is the correct answer indeed (upvoted) - can't understand the downvotes; I proceed to explain in more detail...
2

The root cause of your issue is that, while you ask for the evaluation of 6 models in GridSearchCV, you provide parameters only for the first 2 ones:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]

The result of enumerate(zip(models, params)) in this setting, i.e:

for i, (model, param) in enumerate(zip(models, params)):
    print((model, param))

is

(SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), {'C': [0.01, 1]})
(RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False), {'n_estimators': [10, 20]})

i.e the last 4 models are simply ignored, so you get empty entries for them in cv:

print(cv)
# result:
[[5950.6018771284835, 5987.293514740653, 6055.368320208183, 6099.316091619069, 6146.478702335218], [3625.3243553665975, 3301.3552182952058, 3404.3321983193728, 3521.5160621260898, 3561.254684271113], [], [], [], []]

which causes the downstream error when trying to get the np.mean(cv, 1).

The solution, as already correctly pointed out by Psi in their answer, is to go for empty dictionaries in the models in which you actually don't perform any CV search; omitting the XGBRegressor (have not installed it), here are the results:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso()]
params2 = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}]

cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
    for i, (model, param) in enumerate(zip(models, params2)):
        best_m = GridSearchCV(model, param)
        best_m.fit(X[tr], y[tr])
        s = mean_squared_error(y[ts], best_m.predict(X[ts]))
        cv[i].append(s)

where print(cv) gives:

[[4048.660483326826, 3973.984055352062, 3847.7215568088545, 3907.0566348092684, 3820.0517432992765], [1037.9378737329769, 1025.237441119364, 1016.549294695313, 993.7083268195154, 963.8115632611381], [2.2948917095935095e-26, 1.971022007799432e-26, 4.1583774042712844e-26, 2.0229469068846665e-25, 1.9295075684919642e-26], [0.0003350178681602639, 0.0003297411022124562, 0.00030834076832371557, 0.0003355298330301431, 0.00032049282437794516], [10.372789356303688, 10.137748082073076, 10.136028304131141, 10.499159069700834, 9.80779910439471]]

and print(np.mean(cv, 1)) works OK, giving:

[3.91949489e+03 1.00744890e+03 6.11665355e-26 3.25824479e-04
 1.01907048e+01]

So, in your case, you should indeed change params to:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]

as already suggested by Psi.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.