Discovering time-series cross-validation
Time-series data has a unique characteristic in nature. Unlike "normal" data, which is assumed to be independent and identically distributed (IID), time-series data does not follow that assumption. In fact, each sample is dependent on previous samples, meaning changing the order of the samples will result in different data interpretations.
Several examples of time-series data are listed as follows:
- Daily stock market price
- Hourly temperature data
- Minute-by-minute web page clicks count
There will be a look-ahead bias if we apply previous cross-validation strategies (for example, k-fold or random or stratified splits) to time-series data. Look-ahead bias happens when we use the future value of the data that is supposedly not available for the current time of the simulation.
For instance, we are working with hourly temperature data. We want to predict what the temperature will be in 2 hours, but we use the temperature value of the next hour or the next 3 hours, which is supposedly not available yet. This kind of bias will happen easily if we apply the previous cross-validation strategies since those strategies are designed to work well only on IID distribution.
Time-series cross-validation is the cross-validation strategy that is specifically designed to handle time-series data. It works similarly to k-fold in terms of accepting the predefined values of folds, which then generates k test sets. The difference is that the data is not shuffled in the first place, and the training set in the next iteration is the superset of the one in the previous iteration, meaning the training set keeps getting bigger over the number of iterations. Once we finish with the cross-validation and get the final model configuration, we can then test our final model on the test data (see Figure 1.4):
Figure 1.4 – Time-series cross-validation
Also, the Scikit-Learn package provides us with a nice implementation of this strategy:
from sklearn.model_selection import train_test_split, TimeSeriesSplit df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, shuffle=False) tscv = TimeSeriesSplit(n_splits=5) for train_index, val_index in tscv.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
Providing n_splits=5 will ensure that there are five test sets generated. It is worth noting that, by default, the train set will have the size of for the ith fold, while the test set will have the size of .
However, you can change the train and test set size via the max_train_size
and test_size
arguments of the TimeSeriesSplit
function. Additionally, there is also a gap
argument that can be utilized to exclude G samples from the end of each train set, where G is the value needed to be specified by the developer.
You need to be aware that the Scikit-Learn implementation will always make sure that there is no overlap between test sets, which is actually not necessary. Currently, there is no way to enable the overlap between the test sets using the Scikit-Learn implementation. You need to write the code from scratch to perform that kind of strategy.
In this section, we learned about the unique characteristic of time-series data and how to perform a cross-validation strategy on it. There are other variations of the cross-validation strategy that haven't been covered in this book. If you are interested, you might find some pointers in the Further reading section.