How to fix "IndexError: tuple index out of range" in python?

Question

I am using sklearn modules to find the best fitting models and model parameters. However, I have an unexpected Index error down below:

> IndexError                                Traceback (most recent call
> last) <ipython-input-38-ea3f99e30226> in <module>
>      22             s = mean_squared_error(y[ts], best_m.predict(X[ts]))
>      23             cv[i].append(s)
> ---> 24     print(np.mean(cv, 1))
> IndexError: tuple index out of range

what I want to do is to find best fitting regressor and its parameters, but I got above error. I looked into SO and tried this solution but still, same error bumps up. any idea to fix this bug? can anyone point me out why this error happening? any thought?

my code:

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from xgboost.sklearn import XGBRegressor

from sklearn.datasets import make_regression

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]

X, y = make_regression(n_samples=10000, n_features=20)

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    cv = [[] for _ in range(len(models))]
    fold = KFold(5,shuffle=False)
    for tr, ts in fold.split(X):
        for i, (model, param) in enumerate(zip(models, params)):
            best_m = GridSearchCV(model, param)
            best_m.fit(X[tr], y[tr])
            s = mean_squared_error(y[ts], best_m.predict(X[ts]))
            cv[i].append(s)
    print(np.mean(cv, 1))

desired output:

if there is a way to fix up above error, I am expecting to pick up best-fitted models with parameters, then use it for estimation. Any idea to improve the above attempt? Thanks

@desertnaut Do you think how can I optimize this code? any better idea? — Jerry07
– Jerry07, Commented Jul 16, 2019 at 16:47
That's a very general question, but doing a grid search in each one of 5 folds sounds like overkill. I kindly suggest you open another question asking for advice in this (be sure to make your code fully reproducible, including all relevant imports). — desertnaut
– desertnaut, Commented Jul 16, 2019 at 16:56
The error can be reproduced with np.mean([],1), which supports the idea the cv is [], or contains [] lists. — hpaulj
– hpaulj, Commented Jul 16, 2019 at 17:59

Psi · Accepted Answer · 2019-07-16 16:33:42Z

3

When you define

cv = [[] for _ in range(len(models))]

it has an empty list for each model. In the loop, however, you go over enumerate(zip(models, params)) which has only two elements, since your params list has two elements (because list(zip(x,y)) has length equal to min(len(x),len(y)).

Hence, you get an IndexError because some of the lists in cv are empty (all but the first two) when you calculate the mean with np.mean.

Solution: If you don't need to use GridSearchCV on the remaining models you may just extend the params list with empty dictionaries:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]

edited Jul 16, 2019 at 16:33

answered Jul 16, 2019 at 16:06

Psi

831 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jerry07 Over a year ago

I don't think this is the answer for this question. Please read SO community rule.

Psi Over a year ago

@Dan Since you haven't posted a MWE I can't verify with certainty that this is the solution, but it works with your code after importing the appropiate modules and it matches the output you gave in the comments for cv (see the last edit for the specific change you would have to make to params).

desertnaut Over a year ago

This is the correct answer indeed (upvoted) - can't understand the downvotes; I proceed to explain in more detail...

desertnaut · Accepted Answer · 2019-07-16 16:44:00Z

The root cause of your issue is that, while you ask for the evaluation of 6 models in GridSearchCV, you provide parameters only for the first 2 ones:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]

The result of enumerate(zip(models, params)) in this setting, i.e:

for i, (model, param) in enumerate(zip(models, params)):
    print((model, param))

is

(SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), {'C': [0.01, 1]})
(RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False), {'n_estimators': [10, 20]})

i.e the last 4 models are simply ignored, so you get empty entries for them in cv:

print(cv)
# result:
[[5950.6018771284835, 5987.293514740653, 6055.368320208183, 6099.316091619069, 6146.478702335218], [3625.3243553665975, 3301.3552182952058, 3404.3321983193728, 3521.5160621260898, 3561.254684271113], [], [], [], []]

which causes the downstream error when trying to get the np.mean(cv, 1).

The solution, as already correctly pointed out by Psi in their answer, is to go for empty dictionaries in the models in which you actually don't perform any CV search; omitting the XGBRegressor (have not installed it), here are the results:

models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso()]
params2 = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}]

cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
    for i, (model, param) in enumerate(zip(models, params2)):
        best_m = GridSearchCV(model, param)
        best_m.fit(X[tr], y[tr])
        s = mean_squared_error(y[ts], best_m.predict(X[ts]))
        cv[i].append(s)

where print(cv) gives:

[[4048.660483326826, 3973.984055352062, 3847.7215568088545, 3907.0566348092684, 3820.0517432992765], [1037.9378737329769, 1025.237441119364, 1016.549294695313, 993.7083268195154, 963.8115632611381], [2.2948917095935095e-26, 1.971022007799432e-26, 4.1583774042712844e-26, 2.0229469068846665e-25, 1.9295075684919642e-26], [0.0003350178681602639, 0.0003297411022124562, 0.00030834076832371557, 0.0003355298330301431, 0.00032049282437794516], [10.372789356303688, 10.137748082073076, 10.136028304131141, 10.499159069700834, 9.80779910439471]]

and print(np.mean(cv, 1)) works OK, giving:

[3.91949489e+03 1.00744890e+03 6.11665355e-26 3.25824479e-04
 1.01907048e+01]

So, in your case, you should indeed change params to:

params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]

as already suggested by Psi.

Collectives™ on Stack Overflow

How to fix "IndexError: tuple index out of range" in python?

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related