-
Book Overview & Buying
-
Table Of Contents
Practical Machine Learning with R
By :
Solution:
#Time series features
library(caret)
#Install caret if not installed
#install.packages('caret')
GermanCredit = read.csv("GermanCredit.csv")
duration<- GermanCredit$Duration #take the duration column
summary(duration)
The output is as follows:

library(ggplot2)
ggplot(data=GermanCredit, aes(x=GermanCredit$Duration)) +
geom_density(fill='lightblue') +
geom_rug() +
labs(x='mean Duration')
The output is as follows:

#Creating Bins
# set up boundaries for intervals/bins
breaks <- c(0,10,20,30,40,50,60,70,80)
# specify interval/bin labels
labels <- c("<10", "10-20", "20-30", "30-40", "40-50", "50-60", "60-70", "70-80")
# bucketing data points into bins
bins <- cut(duration, breaks, include.lowest = T, right=FALSE, labels=labels)
# inspect bins
summary(bins)
The output is as follows:
summary(bins)
<10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
143 403 241 131 66 2 13 1
#Ploting the bins
plot(bins, main="Frequency of Duration", ylab="Duration Count", xlab="Duration Bins",col="bisque")
The output is as follows:
We can conclude that the maximum number of customers are within the range of 10 to 20.
Solution:
#Skewness
library(mlbench)
library(e1071)
PimaIndiansDiabetes = read.csv("PimaIndiansDiabetes.csv")
#Printing the skewness of the columns
#Not skewed
skewness(PimaIndiansDiabetes$glucose)
The output is as follows:
[1] 0.1730754
histogram(PimaIndiansDiabetes$glucose)
The output is as follows:

A negative skewness value means that the data is skewed to the left and a positive skewness value means that the data is skewed to the right. Since the value here is 0.17, the data is neither completely left or right skewed. Therefore, it is not skewed.
#Highly skewed
skewness(PimaIndiansDiabetes$age)
The output is as follows:
[1] 1.125188
histogram(PimaIndiansDiabetes$age)
The output is as follows:
The positive skewness value means that it is skewed to the right as we can see above.
Solution:
#PCA Analysis
data(GermanCredit)
#Use the German Credit Data
GermanCredit_subset <- GermanCredit[,1:9]
#Find out the Principal components
principal_components <- prcomp(x = GermanCredit_subset, scale. = T)
#Print the principal components
print(principal_components)
The output is as follows:
Standard deviations (1, .., p=9):
[1] 1.3505916 1.2008442 1.1084157 0.9721503 0.9459586
0.9317018 0.9106746 0.8345178 0.5211137
Rotation (n x k) = (9 x 9):
Therefore, by using principal component analysis we can identify the top nine principal components in the dataset. These components are calculated from multiple fields and they can be used as features on their own.
Solution:
data(GermanCredit)
GermanCredit_subset <- GermanCredit[,1:10]
library(randomForest)
random_forest = randomForest(Class~., data=GermanCredit_subset)
# Create an importance based on mean decreasing gini
importance(random_forest)
The output is as follows:
importance(random_forest)
MeanDecreaseGini
Duration 70.380265
Amount 121.458790
InstallmentRatePercentage 27.048517
ResidenceDuration 30.409254
Age 86.476017
NumberExistingCredits 18.746057
NumberPeopleMaintenance 12.026969
Telephone 15.581802
ForeignWorker 2.888387
varImp(random_forest)
The output is as follows:
Overall
Duration 70.380265
Amount 121.458790
InstallmentRatePercentage 27.048517
ResidenceDuration 30.409254
Age 86.476017
NumberExistingCredits 18.746057
NumberPeopleMaintenance 12.026969
Telephone 15.581802
ForeignWorker 2.888387
In this activity, we built a random forest model and used it to see the importance of each variable in a dataset. The variables with higher scores are considered more important. Having done this, we can sort by importance and choose the top 5 or top 10 for the model or set a threshold for importance and choose all the variables that meet the threshold.
Solution:
install.packages("rpart")
library(rpart)
library(caret)
set.seed(10)
data(GermanCredit)
GermanCredit_subset <- GermanCredit[,1:10]
#Train a rpart model
rPartMod <- train(Class ~ ., data=GermanCredit_subset, method="rpart")
#Find variable importance
rpartImp <- varImp(rPartMod)
#Print variable importance
print(rpartImp)
The output is as follows:
rpart variable importance
Overall
Amount 100.000
Duration 89.670
Age 75.229
ForeignWorker 22.055
InstallmentRatePercentage 17.288
Telephone 7.813
ResidenceDuration 4.471
NumberExistingCredits 0.000
NumberPeopleMaintenance 0.000
#Plot top 5 variable importance
plot(rpartImp, top = 5, main='Variable Importance')
The output is as follows:
From the preceding plot, we can observe that Amount, Duration, and Age have high importance values.
Change the font size
Change margin width
Change background colour