Training, Testing and Validation Datasets in Machine Learning

What they are and when we use them in terms of ML

December 24, 2022 · 3 mins read

Training, Testing and Validation Datasets in Machine Learning

At the end of the article, we can answer to the questions:

  1. What are "Training dataset", "Testing dataset" and "Validation dataset"
  2. When we use them in terms of machine learning?

The gole of a machine learning is to make a general model by training the model.
The general model for a classification problem should classify the unseen data well that the model never trained on.

The main key here is to split the dataset for the training and the testing because the test of the trained model should be done with unseen data from the model.

If we train and test on the same data, it is like that your trial exam (for training) and the real exam (for testing) are same.
Then the model cannot predict well when there is new data to be classified (no general model at all).

This is why we split the dataset into two parts:

  1. Training dataset (for training a model)
  2. Testing dataset (for testing the trained model)

Typically, 80% of the whole dataset is used for training and 20% of the dataset for testing (the percentage can vary):

png

Then what is validation dataset and when we need it?

Validation dataset is used for validating and tuning the model before we test the final trained model.
When tuning the model, we adjust the hyperparmeters of the model in order to find the best model to be evaluated.

We get the validation datset by spliting the whole dataset as follows:

png

When the dataset is enough, we normally use 60% of the data for training, 20% of the data for validation, and 20% of the data for testing.

The most important thing by the validation dataset is that the model occasionally “sees” this data to fine tune the model, but never “learns” from the data.

In the next post, we will learn:

  1. How we split the dataset when it is not enough. ("K-Fold Cross Validation")
  2. Why we use it for model selection


References