-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Practical Machine Learning with R
By :
Solution:
# Attach packages
library(groupdata2)
library(cvms)
library(caret)
library(knitr)
# Set seed for reproducibility and easy comparison
set.seed(1)
# Load the cars dataset
data(cars)
# Partition the dataset
partitions <- partition(cars, p = 0.8)
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
# Fit a couple of linear models and interpret them
# Model 1 - Predicting price by mileage
model_1 <- lm(Price ~ Mileage, data = train_set)
summary(model_1)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12977 -7400 -3453 6009 45540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.496e+04 1.019e+03 24.488 < 2e-16 ***
## Mileage -1.736e-01 4.765e-02 -3.644 0.000291 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9762 on 641 degrees of freedom
## Multiple R-squared: 0.02029, Adjusted R-squared: 0.01876
## F-statistic: 13.28 on 1 and 641 DF, p-value: 0.0002906
# Model 2 - Predicting price by number of doors
model_2 <- lm(Price ~ Doors, data = train_set)
summary(model_2)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Doors, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12540 -7179 -2934 5814 45805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25682.2 1662.1 15.452 <2e-16 ***
## Doors -1176.6 457.5 -2.572 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9812 on 641 degrees of freedom
## Multiple R-squared: 0.01021, Adjusted R-squared: 0.008671
## F-statistic: 6.615 on 1 and 641 DF, p-value: 0.01034
# Model 3 - Predicting price by mileage and number of doors
model_3 <- lm(Price ~ Mileage + Doors, data = train_set)
summary(model_3)
The summary of the model is as follows:
##
## Call:
## lm(formula = Price ~ Mileage + Doors, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12642 -7503 -3000 5595 43576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.945e+04 1.926e+03 15.292 < 2e-16 ***
## Mileage -1.786e-01 4.744e-02 -3.764 0.000182 ***
## Doors -1.242e+03 4.532e+02 -2.740 0.006308 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9713 on 640 degrees of freedom
## Multiple R-squared: 0.03165, Adjusted R-squared: 0.02863
## F-statistic: 10.46 on 2 and 640 DF, p-value: 3.388e-05
# Create list of model formulas with combine_predictors()
# Use only the 4 first predictors (to save time)
# Limit the number of fixed effects (predictors) to 3,
# Limit the biggest possible interaction to a 2-way interaction
# Limit the number of times a fixed effect is included to 1
model_formulas <- combine_predictors(
dependent = "Price",
fixed_effects = c("Mileage", "Cylinder",
"Doors", "Cruise"),
max_fixed_effects = 3,
max_interaction_size = 2,
max_effect_frequency = 1)
# Output model formulas
model_formulas
The output is as follows:
## [1] "Price ~ Cruise"
## [2] "Price ~ Cylinder"
## [3] "Price ~ Doors"
## [4] "Price ~ Mileage"
## [5] "Price ~ Cruise * Cylinder"
## [6] "Price ~ Cruise * Doors"
## [7] "Price ~ Cruise * Mileage"
## [8] "Price ~ Cruise + Cylinder"
## [9] "Price ~ Cruise + Doors"
## [10] "Price ~ Cruise + Mileage"
## [11] "Price ~ Cylinder * Doors"
## [12] "Price ~ Cylinder * Mileage"
## [13] "Price ~ Cylinder + Doors"
## [14] "Price ~ Cylinder + Mileage"
## [15] "Price ~ Doors * Mileage"
## [16] "Price ~ Doors + Mileage"
## [17] "Price ~ Cruise * Cylinder + Doors"
## [18] "Price ~ Cruise * Cylinder + Mileage"
## [19] "Price ~ Cruise * Doors + Cylinder"
## [20] "Price ~ Cruise * Doors + Mileage"
## [21] "Price ~ Cruise * Mileage + Cylinder"
## [22] "Price ~ Cruise * Mileage + Doors"
## [23] "Price ~ Cruise + Cylinder * Doors"
## [24] "Price ~ Cruise + Cylinder * Mileage"
## [25] "Price ~ Cruise + Cylinder + Doors"
## [26] "Price ~ Cruise + Cylinder + Mileage"
## [27] "Price ~ Cruise + Doors * Mileage"
## [28] "Price ~ Cruise + Doors + Mileage"
## [29] "Price ~ Cylinder * Doors + Mileage"
## [30] "Price ~ Cylinder * Mileage + Doors"
## [31] "Price ~ Cylinder + Doors * Mileage"
## [32] "Price ~ Cylinder + Doors + Mileage"
# Create 5 fold columns with 4 folds each in the training set
train_set <- fold(train_set, k = 4,
num_fold_cols = 5)
# Create list of fold column names
fold_cols <- paste0(".folds_", 1:5)
# Cross-validate the models with cvms
CV_results <- cross_validate(train_set,
models = model_formulas,
fold_cols = fold_cols,
family = "gaussian")
# Select the best model by RMSE
# Order by RMSE
CV_results <- CV_results[order(CV_results$RMSE),]
# Select the 10 best performing models for printing
# (Feel free to view all the models)
CV_results_top10 <- head(CV_results, 10)
# Show metrics and model definition columns
# Use kable for a prettier output
kable(select_metrics(CV_results_top10), digits = 2)
The output is as follows:

# Evaluate the best model on the validation set with validate()
V_results <- validate(
train_data = train_set,
test_data = valid_set,
models = "Price ~ Cruise * Cylinder + Mileage",
family = "gaussian")
valid_results <- V_results$Results
valid_model <- V_results$Models[[1]]
# Print the results
kable(select_metrics(valid_results), digits = 2)
The output is as follows:

# Print the model summary and interpret it
summary(valid_model)
The summary is as follows:
##
## Call:
## lm(formula = model_formula, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10485 -5495 -1425 3494 34693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8993.2446 3429.9320 2.622 0.00895 **
## Cruise -1311.6871 3585.6289 -0.366 0.71462
## Cylinder 1809.5447 741.9185 2.439 0.01500 *
## Mileage -0.1569 0.0367 -4.274 2.21e-05 ***
## Cruise:Cylinder 1690.0768 778.7838 2.170 0.03036 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7503 on 638 degrees of freedom
## Multiple R-squared: 0.424, Adjusted R-squared: 0.4203
## F-statistic: 117.4 on 4 and 638 DF, p-value: < 2.2e-16
Solution:
library(groupdata2)
library(cvms)
library(caret)
library(randomForest)
library(rPref)
library(doParallel)
set.seed(3)
# Load the amsterdam.listings dataset
full_data <- read.csv("amsterdam.listings.csv")
full_data$id <- factor(full_data$id)
full_data$neighbourhood <- factor(full_data$neighbourhood)
summary(full_data)
The summary of data is as follows:
## id neighbourhood room_type
## 2818 : 1 De Baarsjes - Oud-West :3052 Entire home/apt:13724
## 20168 : 1 De Pijp - Rivierenbuurt:2166 Private room : 3594
## 25428 : 1 Centrum-West :2019
## 27886 : 1 Centrum-Oost :1498
## 28658 : 1 Westerpark :1315
## 28871 : 1 Zuid :1187
## (Other):17312 (Other) :6081
## availability_365 log_price log_minimum_nights log_number_of_reviews
## Min. : 0.00 Min. :2.079 Min. :0.0000 Min. :0.000
## 1st Qu.: 0.00 1st Qu.:4.595 1st Qu.:0.6931 1st Qu.:1.386
## Median : 0.00 Median :4.852 Median :0.6931 Median :2.398
## Mean : 48.33 Mean :4.883 Mean :0.8867 Mean :2.370
## 3rd Qu.: 49.00 3rd Qu.:5.165 3rd Qu.:1.0986 3rd Qu.:3.219
## Max. :365.00 Max. :7.237 Max. :4.7875 Max. :6.196
##
## log_reviews_per_month
## Min. :-4.60517
## 1st Qu.:-1.42712
## Median :-0.61619
## Mean :-0.67858
## 3rd Qu.: 0.07696
## Max. : 2.48907
##
partitions <- partition(full_data, p = 0.8,
cat_col = "room_type")
train_set <- partitions[[1]]
valid_set <- partitions[[2]]
# Register four CPU cores
registerDoParallel(4)
room_type_baselines <- baseline(test_data = valid_set,
dependent_col = "room_type",
n = 100,
family = "binomial",
parallel = TRUE)
# Inspect summarized metrics
room_type_baselines$summarized_metrics
The output is as follows:
## # A tibble: 10 x 15
## Measure 'Balanced Accur… F1 Sensitivity Specificity
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Mean 0.502 0.295 0.503 0.500
## 2 Median 0.502 0.294 0.502 0.500
## 3 SD 0.0101 0.00954 0.0184 0.00924
## 4 IQR 0.0154 0.0131 0.0226 0.0122
## 5 Max 0.525 0.317 0.551 0.522
## 6 Min 0.480 0.275 0.463 0.480
## 7 NAs 0 0 0 0
## 8 INFs 0 0 0 0
## 9 All_0 0.5 NA 0 1
## 10 All_1 0.5 0.344 1 0
## # … with 10 more variables: 'Pos Pred Value' <dbl>, 'Neg Pred
## # Value' <dbl>, AUC <dbl>, 'Lower CI' <dbl>, 'Upper CI' <dbl>,
## # Kappa <dbl>, MCC <dbl>, 'Detection Rate' <dbl>, 'Detection
## # Prevalence' <dbl>, Prevalence <dbl>
logit_model_1 <- glm("room_type ~ log_number_of_reviews",
data = train_set, family="binomial")
summary(logit_model_1)
The summary of the model is as follows:
## ## Call:
## glm(formula = "room_type ~ log_number_of_reviews", family = "binomial",
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3030 -0.7171 -0.5668 -0.3708 2.3288
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64285 0.05374 -49.18 <2e-16 ***
## log_number_of_reviews 0.49976 0.01745 28.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149 on 13853 degrees of freedom
## Residual deviance: 13249 on 13852 degrees of freedom
## AIC: 13253
##
## Number of Fisher Scoring iterations: 4
logit_model_2 <- glm(
"room_type ~ availability_365 + log_number_of_reviews",
data = train_set, family = "binomial")
summary(logit_model_2)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5277 -0.6735 -0.5535 -0.3688 2.3365
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6622105 0.0533345 -49.91 <2e-16 ***
## availability_365 0.0039866 0.0002196 18.16 <2e-16 ***
## log_number_of_reviews 0.4172148 0.0178015 23.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149 on 13853 degrees of freedom
## Residual deviance: 12935 on 13851 degrees of freedom
## AIC: 12941
##
## Number of Fisher Scoring iterations: 4
logit_model_3 <- glm(
"room_type ~ availability_365 + log_number_of_reviews + log_price",
data = train_set, family = "binomial")
summary(logit_model_3)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7678 -0.5805 -0.3395 -0.1208 3.9864
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.6730455 0.3419771 37.06 <2e-16 ***
## availability_365 0.0081215 0.0002865 28.34 <2e-16 ***
## log_number_of_reviews 0.3613055 0.0199845 18.08 <2e-16 ***
## log_price -3.2539506 0.0745417 -43.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149.2 on 13853 degrees of freedom
## Residual deviance: 9984.7 on 13850 degrees of freedom
## AIC: 9992.7
##
## Number of Fisher Scoring iterations: 6
logit_model_4 <- glm(
"room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
data = train_set, family = "binomial")
summary(logit_model_4)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_number_of_reviews + log_price + log_minimum_nights",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6268 -0.5520 -0.3142 -0.1055 4.6062
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.470868 0.354695 37.98 <2e-16 ***
## availability_365 0.008310 0.000295 28.17 <2e-16 ***
## log_number_of_reviews 0.360133 0.020422 17.64 <2e-16 ***
## log_price -3.252957 0.076343 -42.61 <2e-16 ***
## log_minimum_nights -1.007354 0.051131 -19.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149.2 on 13853 degrees of freedom
## Residual deviance: 9543.9 on 13849 degrees of freedom
## AIC: 9553.9
##
## Number of Fisher Scoring iterations: 6
logit_model_5 <- glm(
"room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
data = train_set, family = "binomial")
summary(logit_model_5)
The summary of the model is as follows:
##
## Call:
## glm(formula = "room_type ~ availability_365 + log_reviews_per_month + log_price + log_minimum_nights",
## family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7351 -0.5229 -0.2934 -0.0968 4.7252
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.7495851 0.3652303 40.38 <2e-16 ***
## availability_365 0.0074308 0.0003019 24.61 <2e-16 ***
## log_reviews_per_month 0.6364218 0.0246423 25.83 <2e-16 ***
## log_price -3.2850702 0.0781567 -42.03 <2e-16 ***
## log_minimum_nights -0.8504701 0.0526379 -16.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149 on 13853 degrees of freedom
## Residual deviance: 9103 on 13849 degrees of freedom
## AIC: 9113
##
## Number of Fisher Scoring iterations: 6
model_formulas <- combine_predictors(
dependent = "room_type",
fixed_effects = c("log_minimum_nights",
"log_number_of_reviews",
"log_price",
"availability_365",
"log_reviews_per_month"),
max_interaction_size = 2,
max_effect_frequency = 1)
head(model_formulas, 10)
The output is as follows:
## [1] "room_type ~ availability_365"
## [2] "room_type ~ log_minimum_nights"
## [3] "room_type ~ log_number_of_reviews"
## [4] "room_type ~ log_price"
## [5] "room_type ~ log_reviews_per_month"
## [6] "room_type ~ availability_365 * log_minimum_nights"
## [7] "room_type ~ availability_365 * log_number_of_reviews"
## [8] "room_type ~ availability_365 * log_price"
## [9] "room_type ~ availability_365 * log_reviews_per_month"
## [10] "room_type ~ availability_365 + log_minimum_nights"
train_set <- fold(train_set, k = 5,
num_fold_cols = 5,
cat_col = "room_type")
initial_cv_results <- cross_validate(
train_set,
models = model_formulas,
fold_cols = ".folds_1",
family = "binomial",
parallel = TRUE)
initial_cv_results <- initial_cv_results[
order(initial_cv_results$F1, decreasing = TRUE),]
head(initial_cv_results, 10)
The output is as follows:
## # A tibble: 10 x 26
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.764 0.662 0.567 0.962 0.797
## 2 0.764 0.662 0.565 0.963 0.798
## 3 0.763 0.662 0.563 0.963 0.801
## 4 0.763 0.661 0.563 0.963 0.800
## 5 0.761 0.654 0.564 0.958 0.778
## 6 0.757 0.653 0.549 0.966 0.807
## 7 0.757 0.652 0.549 0.965 0.804
## 8 0.758 0.649 0.560 0.957 0.774
## 9 0.756 0.649 0.550 0.962 0.792
## 10 0.758 0.649 0.559 0.957 0.775
## # … with 21 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, Folds <int>, 'Fold Columns' <int>, 'Convergence
## # Warnings' <dbl>, 'Singular Fit Messages' <int>, Family <chr>,
## # Link <chr>, Dependent <chr>, Fixed <chr>
# Reconstruct the best 20 models' formulas
reconstructed_formulas <- reconstruct_formulas(
initial_cv_results,
topn = 20)
# Create fold_cols
fold_cols <- paste0(".folds_", 1:5)
# Perform repeated cross-validation
repeated_cv_results <- cross_validate(
train_set,
models = model_formulas,
fold_cols = fold_cols,
family = "binomial",
parallel = TRUE)
# Order by F1
repeated_cv_results <- repeated_cv_results[
order(repeated_cv_results$F1, decreasing = TRUE),]
# Inspect the 10 best modelsresults
head(repeated_cv_results)
The output is as follows:
## # A tibble: 6 x 27
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.764 0.662 0.566 0.962 0.796
## 2 0.763 0.661 0.564 0.963 0.798
## 3 0.763 0.660 0.562 0.963 0.800
## 4 0.762 0.659 0.561 0.963 0.800
## 5 0.761 0.654 0.563 0.958 0.780
## 6 0.758 0.654 0.551 0.965 0.805
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, Results <list>, Folds <int>, 'Fold
## # Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## # Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## # Fixed <chr>
# Find the Pareto front
front <- psel(repeated_cv_results,
pref = high("F1") * high("`Balanced Accuracy`"))
# Remove rows with NA in F1 or Balanced Accuracy
front <- front[complete.cases(front[1:2]), ]
# Inspect front
front
The output is as follows:
## # A tibble: 1 x 27
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.764 0.662 0.566 0.962 0.796
## # … with 22 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, Results <list>, Folds <int>, 'Fold
## # Columns' <int>, 'Convergence Warnings' <dbl>, 'Singular Fit
## # Messages' <int>, Family <chr>, Link <chr>, Dependent <chr>,
## # Fixed <chr>
The best model according to F1 is also the best model by balanced accuracy, so the Pareto front only contains one model.
# Create ggplot object
# with precision on the x-axis and precision on the y-axis
ggplot(repeated_cv_results, aes(x = F1, y = 'Balanced Accuracy')) +
# Add the models as points
geom_point(shape = 1, size = 0.5) +
# Add the nondominated models as larger points
geom_point(data = front, size = 3) +
# Add a line to visualize the pareto front
geom_step(data = front, direction = "vh") +
# Add the light theme
theme_light()
The output is similar to the following:

# Reconstruct the formulas for the front models
reconstructed_formulas <- reconstruct_formulas(front)
# Validate the models in the Pareto front
v_results_list <- validate(train_data = train_set,
test_data = valid_set,
models = reconstructed_formulas,
family = "binomial")
# Assign the results and model(s) to variable names
v_results <- v_results_list$Results
v_model <- v_results_list$Models[[1]]
v_results
The output is as follows:
## # A tibble: 1 x 24
## 'Balanced Accur… F1 Sensitivity Specificity 'Pos Pred Value'
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.758 0.652 0.554 0.962 0.794
## # … with 19 more variables: 'Neg Pred Value' <dbl>, AUC <dbl>, 'Lower
## # CI' <dbl>, 'Upper CI' <dbl>, Kappa <dbl>, MCC <dbl>, 'Detection
## # Rate' <dbl>, 'Detection Prevalence' <dbl>, Prevalence <dbl>,
## # Predictions <list>, ROC <list>, 'Confusion Matrix' <list>,
## # Coefficients <list>, 'Convergence Warnings' <dbl>, 'Singular Fit
## # Messages' <dbl>, Family <chr>, Link <chr>, Dependent <chr>,
## # Fixed <chr>
These results are a lot better than the baseline on both F1 and balanced accuracy.
summary(v_model)
The summary of the model is as follows:
##
## Call:
## glm(formula = model_formula, family = binomial(link = link),
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8323 -0.4836 -0.2724 -0.0919 3.9091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.3685268 0.4511978 34.062 < 2e-16 ***
## availability_365 -0.0140209 0.0030623 -4.579 4.68e-06 ***
## log_price -3.4441520 0.0956189 -36.020 < 2e-16 ***
## log_minimum_nights -0.7163252 0.0535452 -13.378 < 2e-16 ***
## log_number_of_reviews -0.0823821 0.0282115 -2.920 0.0035 **
## log_reviews_per_month 0.0733808 0.0381629 1.923 0.0545 .
## availability_365:log_price 0.0042772 0.0006207 6.891 5.53e-12 ***
## log_n_o_reviews:log_r_p_month 0.3730603 0.0158122 23.593 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14149.2 on 13853 degrees of freedom
## Residual deviance: 8476.7 on 13846 degrees of freedom
## AIC: 8492.7
##
## Number of Fisher Scoring iterations: 6
Note, that we have shortened log_number_of_reviews and log_reviews_per_month in the interaction term, log_n_o_reviews:log_r_p_month. The coefficients for the two interaction terms are both statistically significant. The interaction term log_n_o_reviews:log_r_p_month tells us that when log_number_of_reviews increases by one unit, the coefficient for log_reviews_per_month increases by 0.37, and vice versa. We might question the meaningfulness of including both these predictors in the model, as they have some of the same information. If we also had the number of months the listing had been listed, we should be able to recreate log_number_of_reviews from log_reviews_per_month and the number of months, which would probably be easier to interpret as well.
The second interaction term, availability_365:log_price, tells us that when availability_365 increases by a single unit, the coefficient for log_price increases by 0.004, and vice versa. The coefficient estimate for log_price is -3.44, meaning that when availability_365 is low, a higher log_price decreases the probability that the listing is a private room. This fits with the intuition that a private room is usually cheaper than an entire home/apartment.
The coefficient for log_minimum_nights tells us that when there is a higher minimum requirement for the number of nights when we book the listing, there's a lower probability that the listing is a private room.
Change the font size
Change margin width
Change background colour