Model help using Scikit-learn when using GridSearch

Question

As part of the Enron project, built the attached model, Below is the summary of the steps,

Below model gives highly perfect scores

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
    x_train, x_test = features[train_ind], features[test_ind]
    y_train, y_test = labels[train_ind],labels[test_ind]

    gcv.best_estimator_.predict(x_test)

Below model gives more reasonable but low scores

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
     x_train, x_test = features[train_ind], features[test_ind]
     y_train, y_test = labels[train_ind],labels[test_ind]

     gcv.best_estimator_.fit(x_train,y_train)
     gcv.best_estimator_.predict(x_test)

Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.

The problem is estimator is spitting out perfect scores, in some case 1

But when I refit the best classifier on training data then run the test it gives reasonable scores.

My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.

So, is using the same Shufflesplit for two wrong?

Vivek Kumar · Accepted Answer · 2018-04-18 13:31:50Z

14

GridSearchCV as @Gauthier Feuillen said is used to search best parameters of an estimator for given data. Description of GridSearchCV:-

gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params will be expanded to get all possible combinations separate using ParameterGrid.
features will now be split into features_train and features_test using cv. Same for labels
Now the gridSearch estimator (pipe) will be trained using features_train and labels_inner and scored using features_test and labels_test.
For each possible combination of parameters in step 3, The steps 4 and 5 will be repeated for cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.
For the parameters which give the best score, the internal estimator will be re initialized using those parameters and refit for the whole data supplied into it(features and labels).

Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.

edited Apr 18, 2018 at 13:31

answered Feb 21, 2017 at 10:34

Vivek Kumar

36.8k9 gold badges116 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

naveenpitchai Over a year ago

Thanks for the detailed explanation. Really appreciate it. just two small things, I think you meant steps 4 & 5 will be repeated in step 6. Step 7 is done only when refit = True in GridSearch object right.?

Vivek Kumar Over a year ago

Yes, step 7 is performed when refit=True. By default refit=True in GridSearchCV(). And you also in your code havent specified refit parameter, thats why I did not use it.

MaxU - stand with Ukraine Over a year ago

It's a great and very detailed answer!

Md Sabbir Ahmed Over a year ago

@VivekKumar +1 for nice explanation. In the second approach, he has fit training data once again to the final model (found by best_estimator)_ . Is this step needed?

Vivek Kumar Over a year ago

@Md.SabbirAhmed Initially with refit param in GridSearchCV the best_estimator_ will be trained on whole data supplied in GridSearchCV.fit(). For cross-validation that is not what we want. So in his second approach, he is just calculating the performance of best found parameters in cross-validation folds by training the best_estimator_ again on train data of each fold. That is correct but is not needed because GridSearchCV will calculate the scores for each fold on each param combination.

Gauthier Feuillen · Accepted Answer · 2017-02-21 09:05:18Z

3

Basically the grid search will:

Try every combination of your parameter grid
For each of them it will do a K-fold cross validation
Select the best available.

So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)

answered Feb 21, 2017 at 9:05

Gauthier Feuillen

1841 silver badge7 bronze badges

1 Comment

naveenpitchai Over a year ago

Thanks for the clarification. It makes sense now. I thought the grid search actually just run tests on the test data and not use that for training at all. Really appreciate the response.

Collectives™ on Stack Overflow

Model help using Scikit-learn when using GridSearch

Below model gives highly perfect scores

Below model gives more reasonable but low scores

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Below model gives highly perfect scores

Below model gives more reasonable but low scores

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related