1.5 Model Selection Process

Notes

The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values.

Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic.

The test set can help to avoid the MCP. Obtention of the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best.

Split datasets in training, validation, and test.
Train the models
Evaluate the models
Select the best model
Apply the best model to the test dataset
Compare the performance metrics of validation and test

⚠️	The notes are written by the community. If you see an error here, please create a PR with a fix.

Navigation

Machine Learning Zoomcamp course
Lesson 1: Introduction to Machine Learning
Previous: CRISP-DM
Next: Setting up the Environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05-model-selection.md

05-model-selection.md

1.5 Model Selection Process

Notes

Navigation

Files

05-model-selection.md

Latest commit

History

05-model-selection.md

File metadata and controls

1.5 Model Selection Process

Notes

Navigation