#### Overview of this book

Preface
Section 1:The Methods
Free Chapter
Chapter 1: Evaluating Machine Learning Models
Chapter 2: Introducing Hyperparameter Tuning
Chapter 3: Exploring Exhaustive Search
Chapter 4: Exploring Bayesian Optimization
Chapter 5: Exploring Heuristic Search
Chapter 6: Exploring Multi-Fidelity Optimization
Section 2:The Implementation
Chapter 7: Hyperparameter Tuning via Scikit
Chapter 8: Hyperparameter Tuning via Hyperopt
Chapter 9: Hyperparameter Tuning via Optuna
Chapter 10: Advanced Hyperparameter Tuning with DEAP and Microsoft NNI
Section 3:Putting Things into Practice
Chapter 11: Understanding the Hyperparameters of Popular Algorithms
Chapter 12: Introducing Hyperparameter Tuning Decision Map
Chapter 13: Tracking Hyperparameter Tuning Experiments
Chapter 14: Conclusions and Next Steps
Other Books You May Enjoy

# Creating training, validation, and test sets

We understand that overfitting can be detected by monitoring the model's performance on the training data versus the unseen data, but what exactly is unseen data? Is it just random data that has not yet been seen by the model during the training phase?

Unseen data is a portion of our original complete data that was not seen by the model during the training phase. We usually refer to this unseen data as the test set. Let's imagine you have 100,000 samples of data, to begin with; you can take out a portion of the data, let's say 10% of it, to become the test set. So, now we have 90,000 samples as the training set and 10,000 samples as the testing set.

However, it is better to not just split our original data into train and test sets but also into a validation set, especially when we want to perform hyperparameter tuning on our model. Let's say that out of 100,000 original samples, we held out 10% of it to become the validation set and another 10% to become the test set. Therefore, we will have 80,000 samples as the train set, 10,000 samples as the validation set, and 10,000 samples as the test set.

You might be wondering why do we need a validation set apart from the test set. Actually, we do not need it if we do not want to perform hyperparameter tuning or any other model-centric approaches. This is because the purpose of having a validation set is to have an unbiased evaluation of the test set using the final version of the trained model.

A validation set can help us to get an unbiased evaluation of the test set because we only incorporate the validation set during the hyperparameter tuning phase. Once we finish the hyperparameter tuning phase and get the final model configuration, we can then evaluate our model on the purely unseen data, which is called the test set.

Important Note

If you are going to perform any data preprocessing steps (for example, missing value imputation, feature engineering, standardization, label encoding, and more), you have to build the function based on the train set and then apply it to the validation and test set. Do not perform those data preprocessing steps on the full original data (before data splitting). That's because it might lead to a data leakage problem.

There is no specific rule when it comes to choosing the proportions for each of the train, validation, and test sets. You have to choose the split proportion by yourself based on the condition you are faced with. However, the common splitting proportion used by the data science community is 8:2 or 9:1 for the train set and the validation and test set, respectively. Usually, the validation and test set will have a proportion of 1:1. Therefore, the common splitting proportion is 8:1:1 or 9:0.5:0.5 for the train, validation, and test sets, respectively.

Now that we are aware of the train, validation, and test set concept, we need to learn how to build those sets. Do we just randomly split our original data into three sets? Or can we also apply some predefined rules? In the next section, we will explore this topic in more detail.