The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values.
Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic.
The test set can help to avoid the MCP. Obtention of the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best.
- Split datasets in training, validation, and test.
- Train the models
- Evaluate the models
- Select the best model
- Apply the best model to the test dataset
- Compare the performance metrics of validation and test
The notes are written by the community. If you see an error here, please create a PR with a fix. |