Book Image

Practical Machine Learning with R

By : Brindha Priyadarshini Jeyaraman, Ludvig Renbo Olsen, Monicah Wambugu
Book Image

Practical Machine Learning with R

By: Brindha Priyadarshini Jeyaraman, Ludvig Renbo Olsen, Monicah Wambugu

Overview of this book

With huge amounts of data being generated every moment, businesses need applications that apply complex mathematical calculations to data repeatedly and at speed. With machine learning techniques and R, you can easily develop these kinds of applications in an efficient way. Practical Machine Learning with R begins by helping you grasp the basics of machine learning methods, while also highlighting how and why they work. You will understand how to get these algorithms to work in practice, rather than focusing on mathematical derivations. As you progress from one chapter to another, you will gain hands-on experience of building a machine learning solution in R. Next, using R packages such as rpart, random forest, and multiple imputation by chained equations (MICE), you will learn to implement algorithms including neural net classifier, decision trees, and linear and non-linear regression. As you progress through the book, you’ll delve into various machine learning techniques for both supervised and unsupervised learning approaches. In addition to this, you’ll gain insights into partitioning the datasets and mechanisms to evaluate the results from each model and be able to compare them. By the end of this book, you will have gained expertise in solving your business problems, starting by forming a good problem statement, selecting the most appropriate model to solve your problem, and then ensuring that you do not overtrain it.
Table of Contents (8 chapters)

Chapter 5: Linear and Logistic Regression Models

Activity 18: Implementing Linear Regression

Solution:

  1. Attach the packages:

    # Attach packages

    library(groupdata2)

    library(cvms)

    library(caret)

    library(knitr)

  2. Set the random seed to 1:

    # Set seed for reproducibility and easy comparison

    set.seed(1)

  3. Load the cars dataset from caret:

    # Load the cars dataset

    data(cars)

  4. Partition the dataset into a training set (80%) and a validation set (20%):

    # Partition the dataset

    partitions <- partition(cars, p = 0.8)

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  5. Fit multiple linear regression models on the training set with the lm() function, predicting Price. Try different predictors. View and interpret the summary() of each fitted model. How do the interpretations change when you add or subtract predictors?

    # Fit a couple of linear models and interpret them

    # Model 1 - Predicting price by mileage

    model_1 <- lm(Price ~ Mileage, data = train_set)

    summary(model_1)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = Price ~ Mileage, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -12977  -7400  -3453   6009  45540

    ##

    ## Coefficients:

    ##               Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)  2.496e+04  1.019e+03  24.488  < 2e-16 ***

    ## Mileage     -1.736e-01  4.765e-02  -3.644 0.000291 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 9762 on 641 degrees of freedom

    ## Multiple R-squared:  0.02029,    Adjusted R-squared:  0.01876

    ## F-statistic: 13.28 on 1 and 641 DF,  p-value: 0.0002906

    # Model 2 - Predicting price by number of doors

    model_2 <- lm(Price ~ Doors, data = train_set)

    summary(model_2)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = Price ~ Doors, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -12540  -7179  -2934   5814  45805

    ##

    ## Coefficients:

    ##             Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)  25682.2     1662.1  15.452   <2e-16 ***

    ## Doors        -1176.6      457.5  -2.572   0.0103 *

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 9812 on 641 degrees of freedom

    ## Multiple R-squared:  0.01021,    Adjusted R-squared:  0.008671

    ## F-statistic: 6.615 on 1 and 641 DF,  p-value: 0.01034

    # Model 3 - Predicting price by mileage and number of doors

    model_3 <- lm(Price ~ Mileage + Doors, data = train_set)

    summary(model_3)

    The summary of the model is as follows:

    ##

    ## Call:

    ## lm(formula = Price ~ Mileage + Doors, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -12642  -7503  -3000   5595  43576

    ##

    ## Coefficients:

    ##               Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)  2.945e+04  1.926e+03  15.292  < 2e-16 ***

    ## Mileage     -1.786e-01  4.744e-02  -3.764 0.000182 ***

    ## Doors       -1.242e+03  4.532e+02  -2.740 0.006308 **

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 9713 on 640 degrees of freedom

    ## Multiple R-squared:  0.03165,    Adjusted R-squared:  0.02863

    ## F-statistic: 10.46 on 2 and 640 DF,  p-value: 3.388e-05

  6. Create model formulas with combine_predictors(). Limit the number of possibilities by a) using only the first four predictors, b) limiting the number of fixed effects in the formulas to three by specifying max_fixed_effects = 3, c) limiting the biggest possible interaction to a two-way interaction by specifying max_interaction_size = 2, and d) limiting the number of times a predictor can be included in a formula by specifying max_effect_frequency = 1. These limitations will decrease the number of models to run, which you may or may not want in your own projects:

    # Create list of model formulas with combine_predictors()

    # Use only the 4 first predictors (to save time)

    # Limit the number of fixed effects (predictors) to 3,

    # Limit the biggest possible interaction to a 2-way interaction

    # Limit the number of times a fixed effect is included to 1

    model_formulas <- combine_predictors(

        dependent = "Price",

        fixed_effects = c("Mileage", "Cylinder",

                          "Doors", "Cruise"),

        max_fixed_effects = 3,

        max_interaction_size = 2,

        max_effect_frequency = 1)

    # Output model formulas

    model_formulas

    The output is as follows:

    ##  [1] "Price ~ Cruise"                    

    ##  [2] "Price ~ Cylinder"                  

    ##  [3] "Price ~ Doors"                     

    ##  [4] "Price ~ Mileage"                   

    ##  [5] "Price ~ Cruise * Cylinder"         

    ##  [6] "Price ~ Cruise * Doors"            

    ##  [7] "Price ~ Cruise * Mileage"          

    ##  [8] "Price ~ Cruise + Cylinder"         

    ##  [9] "Price ~ Cruise + Doors"            

    ## [10] "Price ~ Cruise + Mileage"          

    ## [11] "Price ~ Cylinder * Doors"          

    ## [12] "Price ~ Cylinder * Mileage"        

    ## [13] "Price ~ Cylinder + Doors"          

    ## [14] "Price ~ Cylinder + Mileage"        

    ## [15] "Price ~ Doors * Mileage"           

    ## [16] "Price ~ Doors + Mileage"           

    ## [17] "Price ~ Cruise * Cylinder + Doors"

    ## [18] "Price ~ Cruise * Cylinder + Mileage"

    ## [19] "Price ~ Cruise * Doors + Cylinder"

    ## [20] "Price ~ Cruise * Doors + Mileage"  

    ## [21] "Price ~ Cruise * Mileage + Cylinder"

    ## [22] "Price ~ Cruise * Mileage + Doors"  

    ## [23] "Price ~ Cruise + Cylinder * Doors"

    ## [24] "Price ~ Cruise + Cylinder * Mileage"

    ## [25] "Price ~ Cruise + Cylinder + Doors"

    ## [26] "Price ~ Cruise + Cylinder + Mileage"

    ## [27] "Price ~ Cruise + Doors * Mileage"  

    ## [28] "Price ~ Cruise + Doors + Mileage"  

    ## [29] "Price ~ Cylinder * Doors + Mileage"

    ## [30] "Price ~ Cylinder * Mileage + Doors"

    ## [31] "Price ~ Cylinder + Doors * Mileage"

    ## [32] "Price ~ Cylinder + Doors + Mileage"

  7. Create fivefold columns with four folds each in the training set, using fold() with k = 4 and num_fold_cols = 5. Feel free to choose a higher number of fold columns:

    # Create 5 fold columns with 4 folds each in the training set

    train_set <- fold(train_set, k = 4,

                      num_fold_cols = 5)

  8. Create the fold column names with paste0():

    # Create list of fold column names

    fold_cols <- paste0(".folds_", 1:5)

  9. Perform repeated cross-validation on your model formulas with cvms:

    # Cross-validate the models with cvms

    CV_results <- cross_validate(train_set,

                                 models = model_formulas,

                                 fold_cols = fold_cols,

                                 family = "gaussian")

  10. Print the top 10 performing models according to RMSE. Select the best model:

    # Select the best model by RMSE

    # Order by RMSE

    CV_results <- CV_results[order(CV_results$RMSE),]

    # Select the 10 best performing models for printing

    # (Feel free to view all the models)

    CV_results_top10 <- head(CV_results, 10)

    # Show metrics and model definition columns

    # Use kable for a prettier output

    kable(select_metrics(CV_results_top10), digits = 2)

    The output is as follows:

    Figure 5.29: Top 10 performing models using RMSE
    Figure 5.29: Top 10 performing models using RMSE
  11. Fit the best model on the entire training set and evaluate it on the validation set. This can be done with the validate() function in cvms:

    # Evaluate the best model on the validation set with validate()

    V_results <- validate(

        train_data = train_set,

        test_data = valid_set,

        models = "Price ~ Cruise * Cylinder + Mileage",

        family = "gaussian")

  12. The output contains the results data frame and the trained model. Assign these to variable names:

    valid_results <- V_results$Results

    valid_model <- V_results$Models[[1]]

  13. Print the results:

    # Print the results

    kable(select_metrics(valid_results), digits = 2)

    The output is as follows:

    Figure 5.30: Results of the validated model
    Figure 5.30: Results of the validated model
  14. View and interpret the summary of the best model:

    # Print the model summary and interpret it

    summary(valid_model)

    The summary is as follows:

    ##

    ## Call:

    ## lm(formula = model_formula, data = train_set)

    ##

    ## Residuals:

    ##    Min     1Q Median     3Q    Max

    ## -10485  -5495  -1425   3494  34693

    ##

    ## Coefficients:

    ##                   Estimate Std. Error t value Pr(>|t|)   

    ## (Intercept)      8993.2446  3429.9320   2.622  0.00895 **

    ## Cruise          -1311.6871  3585.6289  -0.366  0.71462   

    ## Cylinder         1809.5447   741.9185   2.439  0.01500 *

    ## Mileage            -0.1569     0.0367  -4.274 2.21e-05 ***

    ## Cruise:Cylinder  1690.0768   778.7838   2.170  0.03036 *

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## Residual standard error: 7503 on 638 degrees of freedom

    ## Multiple R-squared:  0.424,  Adjusted R-squared:  0.4203

    ## F-statistic: 117.4 on 4 and 638 DF,  p-value: < 2.2e-16

Activity 19: Classifying Room Types

Solution:

  1. Attach the groupdata2, cvms, caret, randomForest, rPref, and doParallel packages:

    library(groupdata2)

    library(cvms)

    library(caret)

    library(randomForest)

    library(rPref)

    library(doParallel)

  2. Set the random seed to 3:

    set.seed(3)

  3. Load the amsterdam.listings dataset from https://github.com/TrainingByPackt/Practical-Machine-Learning-with-R/blob/master/Data/amsterdam.listings.csv:

    # Load the amsterdam.listings dataset

    full_data <- read.csv("amsterdam.listings.csv")

  4. Convert the id and neighbourhood columns to factors:

    full_data$id <- factor(full_data$id)

    full_data$neighbourhood <- factor(full_data$neighbourhood)

  5. Summarize the dataset:

    summary(full_data)

    The summary of data is as follows:

    ##        id                        neighbourhood            room_type   

    ##  2818   :    1   De Baarsjes - Oud-West :3052   Entire home/apt:13724

    ##  20168  :    1   De Pijp - Rivierenbuurt:2166   Private room   : 3594

    ##  25428  :    1   Centrum-West           :2019                         

    ##  27886  :    1   Centrum-Oost           :1498                         

    ##  28658  :    1   Westerpark             :1315                         

    ##  28871  :    1   Zuid                   :1187                         

    ##  (Other):17312   (Other)                :6081                         

    ##  availability_365   log_price     log_minimum_nights log_number_of_reviews

    ##  Min.   :  0.00   Min.   :2.079   Min.   :0.0000     Min.   :0.000       

    ##  1st Qu.:  0.00   1st Qu.:4.595   1st Qu.:0.6931     1st Qu.:1.386       

    ##  Median :  0.00   Median :4.852   Median :0.6931     Median :2.398       

    ##  Mean   : 48.33   Mean   :4.883   Mean   :0.8867     Mean   :2.370       

    ##  3rd Qu.: 49.00   3rd Qu.:5.165   3rd Qu.:1.0986     3rd Qu.:3.219       

    ##  Max.   :365.00   Max.   :7.237   Max.   :4.7875     Max.   :6.196       

    ##                                                                          

    ##  log_reviews_per_month

    ##  Min.   :-4.60517    

    ##  1st Qu.:-1.42712    

    ##  Median :-0.61619    

    ##  Mean   :-0.67858    

    ##  3rd Qu.: 0.07696    

    ##  Max.   : 2.48907    

    ##

  6. Partition the dataset into a training set (80%) and a validation set (20%). Balance the partitions by room_type:

    partitions <- partition(full_data, p = 0.8,

                            cat_col = "room_type")

    train_set <- partitions[[1]]

    valid_set <- partitions[[2]]

  7. Prepare for running the baseline evaluations and the cross-validations in parallel by registering the number of cores for doParallel:

    # Register four CPU cores

    registerDoParallel(4)

  8. Create the baseline evaluation for the task on the validation set with the baseline() function from cvms. Run 100 evaluations in parallel. Specify the dependent column as room_type. Note that the default positive class is Private room:

    room_type_baselines <- baseline(test_data = valid_set,

                                    dependent_col = "room_type",

                                    n = 100,

                                    family = "binomial",

                                    parallel = TRUE)

    # Inspect summarized metrics

    room_type_baselines$summarized_metrics

    The output is as follows:

    ## # A tibble: 10 x 15

    ##    Measure 'Balanced Accur…       F1 Sensitivity Specificity

    ##    <chr>              <dbl>    <dbl>       <dbl>       <dbl>

    ##  1 Mean              0.502   0.295        0.503      0.500

    ##  2 Median            0.502   0.294        0.502      0.500

    ##  3 SD                0.0101  0.00954      0.0184     0.00924

    ##  4 IQR               0.0154  0.0131       0.0226     0.0122

    ##  5 Max               0.525   0.317        0.551      0.522

    ##  6 Min               0.480   0.275        0.463      0.480

    ##  7 NAs               0       0            0          0     

    ##  8 INFs              0       0            0          0     

    ##  9 All_0             0.5    NA            0          1     

    ## 10 All_1             0.5     0.344        1          0     

    ## # … with 10 more variables: 'Pos Pred Value' <dbl>, 'Neg Pred

    ## #   Value' <dbl>, AUC <dbl>, 'Lower CI' <dbl>, 'Upper CI' <dbl>,

    ## #   Kappa <dbl>, MCC <dbl>, 'Detection Rate' <dbl>, 'Detection

    ## #   Prevalence' <dbl>, Prevalence <dbl>

  9. Fit multiple logistic regression models on the training set with the glm() function, predicting room_type. Try different predictors. View the summary() of each fitted model and try to interpret the estimated coefficients. Observe how the interpretations change when you add or subtract predictors:

    logit_model_1 <- glm("room_type ~ log_number_of_reviews",

                         data = train_set, family="binomial")

    summary(logit_model_1)

    The summary of the model is as follows:

    ## ## Call:

    ## glm(formula = "room_type ~ log_number_of_reviews", family = "binomial",

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -1.3030  -0.7171  -0.5668  -0.3708   2.3288

    ##

    ## Coefficients:

    ##                       Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           -2.64285    0.05374  -49.18   <2e-16 ***

    ## log_number_of_reviews  0.49976    0.01745   28.64   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149  on 13853  degrees of freedom

    ## Residual deviance: 13249  on 13852  degrees of freedom

    ## AIC: 13253

    ##

    ## Number of Fisher Scoring iterations: 4

  10. Add availability_365 as a predictor:

    logit_model_2 <- glm(

        "room_type ~ availability_365 + log_number_of_reviews",

        data = train_set, family = "binomial")

    summary(logit_model_2)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_number_of_reviews",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -1.5277  -0.6735  -0.5535  -0.3688   2.3365

    ##

    ## Coefficients:

    ##                         Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           -2.6622105  0.0533345  -49.91   <2e-16 ***

    ## availability_365       0.0039866  0.0002196   18.16   <2e-16 ***

    ## log_number_of_reviews  0.4172148  0.0178015   23.44   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149  on 13853  degrees of freedom

    ## Residual deviance: 12935  on 13851  degrees of freedom

    ## AIC: 12941

    ##

    ## Number of Fisher Scoring iterations: 4

  11. Add log_price as a predictor:

    logit_model_3 <- glm(

        "room_type ~ availability_365 + log_number_of_reviews + log_price",

        data = train_set, family = "binomial")

    summary(logit_model_3)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.7678  -0.5805  -0.3395  -0.1208   3.9864

    ##

    ## Coefficients:

    ##                         Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           12.6730455  0.3419771   37.06   <2e-16 ***

    ## availability_365       0.0081215  0.0002865   28.34   <2e-16 ***

    ## log_number_of_reviews  0.3613055  0.0199845   18.08   <2e-16 ***

    ## log_price             -3.2539506  0.0745417  -43.65   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149.2  on 13853  degrees of freedom

    ## Residual deviance:  9984.7  on 13850  degrees of freedom

    ## AIC: 9992.7

    ##

    ## Number of Fisher Scoring iterations: 6

  12. Add log_minimum_nights as a predictor:

    logit_model_4 <- glm(

        "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",

         data = train_set, family = "binomial")

    summary(logit_model_4)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.6268  -0.5520  -0.3142  -0.1055   4.6062

    ##

    ## Coefficients:

    ##                        Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           13.470868   0.354695   37.98   <2e-16 ***

    ## availability_365       0.008310   0.000295   28.17   <2e-16 ***

    ## log_number_of_reviews  0.360133   0.020422   17.64   <2e-16 ***

    ## log_price             -3.252957   0.076343  -42.61   <2e-16 ***

    ## log_minimum_nights    -1.007354   0.051131  -19.70   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149.2  on 13853  degrees of freedom

    ## Residual deviance:  9543.9  on 13849  degrees of freedom

    ## AIC: 9553.9

    ##

    ## Number of Fisher Scoring iterations: 6

  13. Replace log_number_of_reviews with log_reviews_per_month:

    logit_model_5 <- glm(

        "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",

        data = train_set, family = "binomial")

    summary(logit_model_5)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",

    ##     family = "binomial", data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.7351  -0.5229  -0.2934  -0.0968   4.7252

    ##

    ## Coefficients:

    ##                         Estimate Std. Error z value Pr(>|z|)   

    ## (Intercept)           14.7495851  0.3652303   40.38   <2e-16 ***

    ## availability_365       0.0074308  0.0003019   24.61   <2e-16 ***

    ## log_reviews_per_month  0.6364218  0.0246423   25.83   <2e-16 ***

    ## log_price             -3.2850702  0.0781567  -42.03   <2e-16 ***

    ## log_minimum_nights    -0.8504701  0.0526379  -16.16   <2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149  on 13853  degrees of freedom

    ## Residual deviance:  9103  on 13849  degrees of freedom

    ## AIC: 9113

    ##

    ## Number of Fisher Scoring iterations: 6

  14. Create model formulas with combine_predictors(). To save time, limit the interaction size to 2 by specifying max_interaction_size = 2, and limit the number of times an effect can be included in a formula to 1 by specifying max_effect_frequency = 1:

    model_formulas <- combine_predictors(

        dependent = "room_type",

        fixed_effects = c("log_minimum_nights",

                          "log_number_of_reviews",

                          "log_price",

                          "availability_365",

                          "log_reviews_per_month"),

        max_interaction_size = 2,

        max_effect_frequency = 1)

    head(model_formulas, 10)

    The output is as follows:

    ##  [1] "room_type ~ availability_365"                       

    ##  [2] "room_type ~ log_minimum_nights"                     

    ##  [3] "room_type ~ log_number_of_reviews"                  

    ##  [4] "room_type ~ log_price"                              

    ##  [5] "room_type ~ log_reviews_per_month"                  

    ##  [6] "room_type ~ availability_365 * log_minimum_nights"  

    ##  [7] "room_type ~ availability_365 * log_number_of_reviews"

    ##  [8] "room_type ~ availability_365 * log_price"           

    ##  [9] "room_type ~ availability_365 * log_reviews_per_month"

    ## [10] "room_type ~ availability_365 + log_minimum_nights"

  15. Create fivefold columns with five folds each in the training set, using fold() with k = 5 and num_fold_cols = 5. Balance the folds by room_type. Feel free to choose a higher number of fold columns:

    train_set <- fold(train_set, k = 5,

                      num_fold_cols = 5,

                      cat_col = "room_type")

  16. Perform cross-validation (not repeated) on your model formulas with cvms. Specify fold_cols = ".folds_1". Order the results by F1 and show the best 10 models:

    initial_cv_results <- cross_validate(

        train_set,

        models = model_formulas,

        fold_cols = ".folds_1",

        family = "binomial",

        parallel = TRUE)

    initial_cv_results <- initial_cv_results[

        order(initial_cv_results$F1, decreasing = TRUE),]

    head(initial_cv_results, 10)

    The output is as follows:

    ## # A tibble: 10 x 26

    ##    'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##               <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ##  1            0.764 0.662       0.567       0.962            0.797

    ##  2            0.764 0.662       0.565       0.963            0.798

    ##  3            0.763 0.662       0.563       0.963            0.801

    ##  4            0.763 0.661       0.563       0.963            0.800

    ##  5            0.761 0.654       0.564       0.958            0.778

    ##  6            0.757 0.653       0.549       0.966            0.807

    ##  7            0.757 0.652       0.549       0.965            0.804

    ##  8            0.758 0.649       0.560       0.957            0.774

    ##  9            0.756 0.649       0.550       0.962            0.792

    ## 10            0.758 0.649       0.559       0.957            0.775

    ## # … with 21 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, Folds <int>, 'Fold Columns' <int>, 'Convergence

    ## #   Warnings' <dbl>, 'Singular Fit Messages' <int>, Family <chr>,

    ## #   Link <chr>, Dependent <chr>, Fixed <chr>

  17. Perform repeated cross-validation on the 10-20 best model formulas (by F1) with cvms:

    # Reconstruct the best 20 models' formulas

    reconstructed_formulas <- reconstruct_formulas(

        initial_cv_results,

        topn = 20)

    # Create fold_cols

    fold_cols <- paste0(".folds_", 1:5)

    # Perform repeated cross-validation

    repeated_cv_results <- cross_validate(

        train_set,

        models = model_formulas,

        fold_cols = fold_cols,

        family = "binomial",

        parallel = TRUE)

    # Order by F1

    repeated_cv_results <- repeated_cv_results[

        order(repeated_cv_results$F1, decreasing = TRUE),]

    # Inspect the 10 best modelsresults

    head(repeated_cv_results)

    The output is as follows:

    ## # A tibble: 6 x 27

    ##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ## 1            0.764 0.662       0.566       0.962            0.796

    ## 2            0.763 0.661       0.564       0.963            0.798

    ## 3            0.763 0.660       0.562       0.963            0.800

    ## 4            0.762 0.659       0.561       0.963            0.800

    ## 5            0.761 0.654       0.563       0.958            0.780

    ## 6            0.758 0.654       0.551       0.965            0.805

    ## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, Results <list>, Folds <int>, 'Fold

    ## #   Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit

    ## #   Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,

    ## #   Fixed <chr>

  18. Find the Pareto front based on the F1 and balanced accuracy scores. Use psel() from the rPref package and specify pref = high("F1") * high("`Balanced Accuracy`"). Note the ticks around `Balanced Accuracy`:

    # Find the Pareto front

    front <- psel(repeated_cv_results,

                  pref = high("F1") * high("`Balanced Accuracy`"))

    # Remove rows with NA in F1 or Balanced Accuracy

    front <- front[complete.cases(front[1:2]), ]

    # Inspect front

    front

    The output is as follows:

    ## # A tibble: 1 x 27

    ##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ## 1            0.764 0.662       0.566       0.962            0.796

    ## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, Results <list>, Folds <int>, 'Fold

    ## #   Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit

    ## #   Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,

    ## #   Fixed <chr>

    The best model according to F1 is also the best model by balanced accuracy, so the Pareto front only contains one model.

  19. Plot the Pareto front with the ggplot2 code from Exercise 61, Plotting the Pareto Front. Note that you may need to add ticks around 'Balanced Accuracy' when specifying x or y in aes() in the ggplot call:

    # Create ggplot object

    # with precision on the x-axis and precision on the y-axis

    ggplot(repeated_cv_results, aes(x = F1, y = 'Balanced Accuracy')) +

      # Add the models as points

      geom_point(shape = 1, size = 0.5) +

      # Add the nondominated models as larger points

      geom_point(data = front, size = 3) +

      # Add a line to visualize the pareto front

      geom_step(data = front, direction = "vh") +

      # Add the light theme

      theme_light()

    The output is similar to the following:

    Figure 5.31: Pareto front with the F1 and balanced accuracy scores
    Figure 5.31: Pareto front with the F1 and balanced accuracy scores
  20. Use validate() to train the nondominated models on the training set and evaluate them on the validation set:

    # Reconstruct the formulas for the front models

    reconstructed_formulas <- reconstruct_formulas(front)

    # Validate the models in the Pareto front

    v_results_list <- validate(train_data = train_set,

                               test_data = valid_set,

                               models = reconstructed_formulas,

                               family = "binomial")

    # Assign the results and model(s) to variable names

    v_results <- v_results_list$Results

    v_model <- v_results_list$Models[[1]]

    v_results

    The output is as follows:

    ## # A tibble: 1 x 24

    ##   'Balanced Accur…    F1 Sensitivity Specificity 'Pos Pred Value'

    ##              <dbl> <dbl>       <dbl>       <dbl>            <dbl>

    ## 1            0.758 0.652       0.554       0.962            0.794

    ## # … with 19 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower

    ## #   CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection

    ## #   Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,

    ## #   Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,

    ## #   Coefficients <list>, 'Convergence Warnings' <dbl>, 'Singular Fit

    ## #   Messages' <dbl>, Family <chr>, Link <chr>, Dependent <chr>,

    ## #   Fixed <chr>

    These results are a lot better than the baseline on both F1 and balanced accuracy.

  21. View the summaries of the nondominated model(s):

    summary(v_model)

    The summary of the model is as follows:

    ##

    ## Call:

    ## glm(formula = model_formula, family = binomial(link = link),

    ##     data = train_set)

    ##

    ## Deviance Residuals:

    ##     Min       1Q   Median       3Q      Max

    ## -3.8323  -0.4836  -0.2724  -0.0919   3.9091

    ##

    ## Coefficients:

    ##                           Estimate Std. Error z value  Pr(>|z|)

    ## (Intercept)               15.3685268  0.4511978  34.062  < 2e-16 ***

    ## availability_365          -0.0140209  0.0030623  -4.579 4.68e-06 ***

    ## log_price                 -3.4441520  0.0956189 -36.020  < 2e-16 ***

    ## log_minimum_nights        -0.7163252  0.0535452 -13.378  < 2e-16 ***

    ## log_number_of_reviews      -0.0823821  0.0282115  -2.920   0.0035 **

    ## log_reviews_per_month      0.0733808  0.0381629   1.923   0.0545 .

    ## availability_365:log_price 0.0042772  0.0006207   6.891 5.53e-12 ***

    ## log_n_o_reviews:log_r_p_month 0.3730603 0.0158122 23.593 < 2e-16 ***

    ## ---

    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    ##

    ## (Dispersion parameter for binomial family taken to be 1)

    ##

    ##     Null deviance: 14149.2  on 13853  degrees of freedom

    ## Residual deviance:  8476.7  on 13846  degrees of freedom

    ## AIC: 8492.7

    ##

    ## Number of Fisher Scoring iterations: 6

    Note, that we have shortened log_number_of_reviews and log_reviews_per_month in the interaction term, log_n_o_reviews:log_r_p_month. The coefficients for the two interaction terms are both statistically significant. The interaction term log_n_o_reviews:log_r_p_month tells us that when log_number_of_reviews increases by one unit, the coefficient for log_reviews_per_month increases by 0.37, and vice versa. We might question the meaningfulness of including both these predictors in the model, as they have some of the same information. If we also had the number of months the listing had been listed, we should be able to recreate log_number_of_reviews from log_reviews_per_month and the number of months, which would probably be easier to interpret as well.

    The second interaction term, availability_365:log_price, tells us that when availability_365 increases by a single unit, the coefficient for log_price increases by 0.004, and vice versa. The coefficient estimate for log_price is -3.44, meaning that when availability_365 is low, a higher log_price decreases the probability that the listing is a private room. This fits with the intuition that a private room is usually cheaper than an entire home/apartment.

    The coefficient for log_minimum_nights tells us that when there is a higher minimum requirement for the number of nights when we book the listing, there's a lower probability that the listing is a private room.