Discovering repeated k-fold cross-validation
Repeated k-fold cross-validation involves simply performing the k-fold cross-validation repeatedly, N times, with different randomizations in each repetition. The final evaluation score is the average of all scores from all folds of each repetition. This strategy will increase our confidence in our model.
So, why repeat the k-fold cross-validation? Why don't we just increase the value of k in k-fold? Surely, increasing the value of k will reduce the bias of our model's estimated performance. However, increasing the value of k will increase the variation, especially when we have a small number of samples. Therefore, usually, repeating the k-folds is a better way to gain higher confidence in our model's estimated performance. Of course, this comes with a drawback, which is the increase in computation time.
To implement this strategy, we can simply perform a manual for-loop, where we apply the k-fold cross-validation strategy to each loop. Fortunately, the Scikit-Learn package provide us with a specific function in which to implement this strategy:
from sklearn.model_selection import train_test_split, RepeatedKFold df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0) rkf = RepeatedKFold(n_splits=4, n_repeats=3, random_state=0) for train_index, val_index in rkf.split(df_cv): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
Choosing n_splits=4
and n_repeats=3
means that we will have 12 different train and validation sets. The final evaluation score is then just the average of all 12 scores. As you might expect, there is also a dedicated function to implement the repeated k-fold in a stratified fashion:
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold df_cv, df_test = train_test_split(df, test_size=0.2, random_state=0, stratify=df['class']) rskf = RepeatedStratifiedKFold(n_splits=4, n_repeats=3, random_state=0) for train_index, val_index in rskf.split(df_cv, df_cv['class']): df_train, df_val = df_cv.iloc[train_index], df_cv.iloc[val_index] #perform training or hyperparameter tuning here
The RepeatedStratifiedKFold
function will perform stratified k-fold cross-validation repeatedly, n_repeats
times.
Now that you have learned another variation of the cross-validation strategy, called repeated k-fold cross-validation, let's learn about the other variations next.