# Discovering time-series cross-validation

**Time-series** data has a unique characteristic in nature. Unlike "normal" data, which is assumed to be **independent and identically distributed** (**IID**), time-series data does not follow that assumption. In fact, each sample is dependent on previous samples, meaning changing the order of the samples will result in different data interpretations.

Several examples of time-series data are listed as follows:

- Daily stock market price
- Hourly temperature data
- Minute-by-minute web page clicks count

There will be a **look-ahead bias** if we apply previous cross-validation strategies (for example, k-fold or random or stratified splits) to time-series data. Look-ahead bias happens when we use the future value of the data that is supposedly not available for the current time of the simulation.

For instance, we are working with hourly temperature data. We want to predict what the temperature will be in 2 hours, but we use the temperature value of the next hour or the next 3 hours, which is supposedly not available yet. This kind of bias will happen easily if we apply the previous cross-validation strategies since those strategies are designed to work well only on IID distribution.

**Time-series cross-validation** is the cross-validation strategy that is specifically designed to handle time-series data. It works similarly to k-fold in terms of accepting the predefined values of folds, which then generates k test sets. The difference is that the *data* *is not shuffled in the first place*, and the training set in the next iteration is the *superset* of the one in the previous iteration, meaning the training set keeps getting bigger over the number of iterations. Once we finish with the cross-validation and get the final model configuration, we can then test our final model on the test data (see *Figure 1.4*):

Figure 1.4 – Time-series cross-validation

Also, the Scikit-Learn package provides us with a nice implementation of this strategy:

from sklearn.model_selection import train_test_split,TimeSeriesSplitdf_cv, df_test = train_test_split(df, test_size=0.2, random_state=0,shuffle=False) tscv = TimeSeriesSplit(n_splits=5) for train_index, val_index in tscv.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here

Providing n_splits=5 will ensure that there are five test sets generated. It is worth noting that, by default, the train set will have the size of for the ith fold, while the test set will have the size of .

However, you can change the train and test set size via the `max_train_size`

and `test_size`

arguments of the `TimeSeriesSplit`

function. Additionally, there is also a `gap`

argument that can be utilized to exclude G samples from the end of each train set, where G is the value needed to be specified by the developer.

You need to be aware that the Scikit-Learn implementation will always make sure that *there is no overlap between test sets*, which is actually not necessary. Currently, there is no way to enable the overlap between the test sets using the Scikit-Learn implementation. You need to *write the code from scratch* to perform that kind of strategy.

In this section, we learned about the unique characteristic of time-series data and how to perform a cross-validation strategy on it. There are other variations of the cross-validation strategy that haven't been covered in this book. If you are interested, you might find some pointers in the *Further reading* section.