#### Overview of this book

Preface
Section 1:The Methods
Free Chapter
Chapter 1: Evaluating Machine Learning Models
Chapter 2: Introducing Hyperparameter Tuning
Chapter 3: Exploring Exhaustive Search
Chapter 4: Exploring Bayesian Optimization
Chapter 5: Exploring Heuristic Search
Chapter 6: Exploring Multi-Fidelity Optimization
Section 2:The Implementation
Chapter 7: Hyperparameter Tuning via Scikit
Chapter 8: Hyperparameter Tuning via Hyperopt
Chapter 9: Hyperparameter Tuning via Optuna
Chapter 10: Advanced Hyperparameter Tuning with DEAP and Microsoft NNI
Section 3:Putting Things into Practice
Chapter 11: Understanding the Hyperparameters of Popular Algorithms
Chapter 12: Introducing Hyperparameter Tuning Decision Map
Chapter 13: Tracking Hyperparameter Tuning Experiments
Chapter 14: Conclusions and Next Steps
Other Books You May Enjoy

# Discovering time-series cross-validation

Time-series data has a unique characteristic in nature. Unlike "normal" data, which is assumed to be independent and identically distributed (IID), time-series data does not follow that assumption. In fact, each sample is dependent on previous samples, meaning changing the order of the samples will result in different data interpretations.

Several examples of time-series data are listed as follows:

• Daily stock market price
• Hourly temperature data
• Minute-by-minute web page clicks count

There will be a look-ahead bias if we apply previous cross-validation strategies (for example, k-fold or random or stratified splits) to time-series data. Look-ahead bias happens when we use the future value of the data that is supposedly not available for the current time of the simulation.

For instance, we are working with hourly temperature data. We want to predict what the temperature will be in 2 hours, but we use the temperature value of the next hour or the next 3 hours, which is supposedly not available yet. This kind of bias will happen easily if we apply the previous cross-validation strategies since those strategies are designed to work well only on IID distribution.

Time-series cross-validation is the cross-validation strategy that is specifically designed to handle time-series data. It works similarly to k-fold in terms of accepting the predefined values of folds, which then generates k test sets. The difference is that the data is not shuffled in the first place, and the training set in the next iteration is the superset of the one in the previous iteration, meaning the training set keeps getting bigger over the number of iterations. Once we finish with the cross-validation and get the final model configuration, we can then test our final model on the test data (see Figure 1.4):

Figure 1.4 – Time-series cross-validation

Also, the Scikit-Learn package provides us with a nice implementation of this strategy:

```from sklearn.model_selection import train_test_split, TimeSeriesSplit
df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, shuffle=False)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, val_index in tscv.split(df_cv):
df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index]
#perform training or hyperparameter tuning here```

Providing n_splits=5 will ensure that there are five test sets generated. It is worth noting that, by default, the train set will have the size of for the ith fold, while the test set will have the size of .

However, you can change the train and test set size via the `max_train_size` and `test_size` arguments of the `TimeSeriesSplit` function. Additionally, there is also a `gap` argument that can be utilized to exclude G samples from the end of each train set, where G is the value needed to be specified by the developer.

You need to be aware that the Scikit-Learn implementation will always make sure that there is no overlap between test sets, which is actually not necessary. Currently, there is no way to enable the overlap between the test sets using the Scikit-Learn implementation. You need to write the code from scratch to perform that kind of strategy.

In this section, we learned about the unique characteristic of time-series data and how to perform a cross-validation strategy on it. There are other variations of the cross-validation strategy that haven't been covered in this book. If you are interested, you might find some pointers in the Further reading section.