Book Image

Learning Predictive Analytics with R

By : Eric Mayor
Book Image

Learning Predictive Analytics with R

By: Eric Mayor

Overview of this book

This book is packed with easy-to-follow guidelines that explain the workings of the many key data mining tools of R, which are used to discover knowledge from your data. You will learn how to perform key predictive analytics tasks using R, such as train and test predictive models for classification and regression tasks, score new data sets and so on. All chapters will guide you in acquiring the skills in a practical way. Most chapters also include a theoretical introduction that will sharpen your understanding of the subject matter and invite you to go further. The book familiarizes you with the most common data mining tools of R, such as k-means, hierarchical regression, linear regression, association rules, principal component analysis, multilevel modeling, k-NN, Naïve Bayes, decision trees, and text mining. It also provides a description of visualization techniques using the basic visualization tools of R as well as lattice for visualizing patterns in data organized in groups. This book is invaluable for anyone fascinated by the data mining opportunities offered by GNU R and its packages.
Table of Contents (23 chapters)
Learning Predictive Analytics with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Exercises and Solutions
Index

Solutions


You will find the solutions to the exercises below.

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 1 already contains the exercises and solutions.

Chapter 2 – Visualizing and Manipulating Data Using R

One way to solve this is to first create a vector with a size of 1001 (line 1). The first value is the money that the player comes with (assigned on line 2). On line 3, we assign a copy of the isRed attribute to the wins vector. On lines 5 to 7, we compute the amount of money at each trial by adding 1 to the money on the next turn if the drawn number is red, and remove 1 if the number is not red. On line 8, we set the graphic parameters to contain a single plot (using argument mfrow) and to be of the required dimensions (10 by 4 inches) using the pin argument. Finally, we generate the plot on line 9:

1  money = rep(0,1001)
2  money[1] = 1000
3  wins = Data$isRed
4  for (i in 1:nrow(Data)) {
5     if (wins[i] == 1) money[i+1] = money[i] + 1
6     else money[i+1] = money[i] - 1
7  }
8  par (mfrow = c(1,1), pin=c(5,2))
9  plot(money, type = "l", xlab = "Trial number")

Run the code and have a look at your screen to know more about the outcomes of the player's game!

Chapter 3 – Data Visualization with Lattice

We plot these elements as follows:

library(lattice)
xyplot(Petal.Length ~ Petal.Width | Species, data = iris,
   panel = function(x, y, ...) {
      panel.xyplot(x,y)
      panel.abline(lm(y~x))
}
)

Chapter 4 – Cluster Analysis

The following code performs all the operations:

library(NbClust)
NbClust(iris[1:4],distance="euclidean", method= "kmeans")
NbClust(iris[1:4],distance="manhattan", method= "kmeans")
NbClust(iris[1:4],distance="maximum", method= "kmeans")

The outputs all show that a two-cluster solution is the best according to a majority rule. This is surprising because we know there are three classes in the dataset, and kmeans() is very accurate at classifying the cases. But apparently, this is not the best way to classify the data. Without prior knowledge, we could have concluded that there are observations for only two species of flowers in the dataset.

Chapter 5 – Agglomerative Clustering Using hclust()

We first create a vector with all the possible distances allowed by the dist() function; we then create a list, called d, of one element—the first distance matrix (with the Euclidean distance). We then add the other distance matrices to the list. Please note that the binary distance was included here for demonstration purposes, but the binary distance is of no practical use with quantitative data:

1  dist.types = c("euclidean", "maximum", "manhattan", "canberra", 
2     "binary", "minkowski")
3  d = list(dist(iris[1:4], method = dist.types[1]))
4  for (i in 2:length(dist.types)) d = c(d,list(dist(iris[1:4], 
5     method = dist.types[i])))

Next, we create a vector with all the possible agglomerating methods allowed by hclust(). We then perform the analysis six times for each of these methods (once per distance matrix), saving the result of each iteration in a matrix of lists:

1  agglo.types = c("ward.D", "ward.D2", "single", "complete", 
2     "average", "mcquitty", "median", "centroid")
3  M = array(list(), 8)
4  for (i in 1:length(agglo.types)) {
5    OneDList = list(hclust(d[[1]], method = agglo.types[i]))
6    for (j in 2:length(dist.types)) OneDList = 
7       c(OneDList, list(hclust(d[[j]], method = agglo.types[i])))
8    M[[i]] = OneDList
9  }

The 48 hclust models stored in the M matrix could be used for selecting the best models, for instance, by examining the plotted models. We can plot results with complete linkage and Euclidean distance (graph not included) like this:

plot(M[[4]][[1]])

The reader can notice that the results using binary distance (for example, with complete linkage as follows) are meaningless:

plot(M[[4]][[5]])

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

Here are the solutions. Before anything, let's make sure the psych package is loaded:

library(psych)
  • The display of the number of missing values can be done for all items in one line of code:

    summary(bfi[,1:25])

    We can see that there are between 9 and 36 missing values for each of the items.

    We assign the cases without missing values to a new object called Dat, retaining only the bfi measures:

    Dat = na.omit(bfi[,1:25])

    We can know how many cases of the original dataset contain missing values using the following line of code. The answer is 364:

    dim(bfi)[1] - dim(Dat)[1]
  • We examine the results of Bartlett's test of sphericity as follows:

    cortest.normal(Dat)

    The results show that the data set is significantly different from an identity matrix. We can examine the KMO index using the following code:

    KMO(Dat)

    The results show that the overall MSA value is large (0.85), as are most individual item values (lowest at 0.76). We deduce from these tests that performing PCA on this data is warranted.

  • We run the PCA using the following line of code:

    pcas = princomp(Dat)
  • We plot the eigenvalues (the squared standard deviations) like this:

    plot(pcas$sd^2)

    We find that a five-factor solution seems optimal, which is great as we have five theoretical factors of personality (the questionnaire is designed that way)

  • We rerun the analysis with five factors and include the scores as follows:

    pcas2 = principal(Dat,nfactors=5, rotate="varimax", scores=T)

    We add the factorial scores to the dataset as follows:

    Dat = cbind(Dat, pcas2$scores)
  • The output of the following line of code shows that the cumulative proportion of variance is 54 percent (under Cumulative var, last column).

    pcas2
  • The component loadings also appear on the output (under Standardized loadings). We can see that the component called RC1 is mostly related to items with letter E (extraversion). RC2 loads higher on items with letter N (neuroticism), RC3 with items with letter C (conscience), RC4 with items with letter O (openness). Finally, RC5 has a higher loading with items with letter A (agreeability).

Chapter 7 – Exploring Association Rules with Apriori

You will find the solution to this exercise below.

  • Here we will start from the beginning, by first obtaining a dataset without an attribute in column 4 (race):

    library(vcdExtra)
    ICU_norace = ICU[-4]
  • We then load the arules package and generate the transaction list:

    library(arules)
  • We then prepare the dataset:

    ICU_norace$age = cut(ICU_norace$age, breaks = 4)
    ICU_norace$systolic = cut(ICU_norace$systolic, breaks = 4)
    ICU_norace$hrtrate = cut(ICU_norace$hrtrate, breaks = 4)
    ICU_norace_tr = as(ICU_norace, "transactions")
  • Then, we generate the rules:

    rulesLowpco = apriori(ICU_norace_tr, parameter = 
       list(confidence = 0.8, support=.1, minlen = 2), 
       appearance = list(lhs = c("pco=<=45"), default="rhs")) 
  • We create a data frame from the rules as follows:

    rulesLowpco.df = as(rulesLowpco,"data.frame")
  • We create the vector of the p value for Fisher's exact test as follows:

    IM = interestMeasure(rulesLowpco, "fishersExactTest", ICU_norace_tr)
  • We then append it to the data frame:

    rulesLowpco.df$IM= round(IM, digits = 2)
  • Finally, we plot the relationship between lift and the p value. We can see that the higher the lift, the lower the p value:

    plot (rulesLowpco.df$lift, rulesLowpco.df$IM, xlab = "lift", 
       ylab = "Significance")

Chapter 8 – Probability Distributions, Covariance, and Correlation

The following code will compute the probabilities of each outcome of interest:

1  p = 18/37
2  N = 100
3  n = 1
4  v40_49 = rep(1,10)
5  v51_60 = v40_49
6  i=1
7  for (n in 40:49) {
8    v40_49[i] = choose(N, n) * (p^n) * (1 - p)^(N-n)
9    i = i + 1
10  }
11  i=1
12  for (n in 51:60) {
13   v51_60[i] = choose(N, n) * (p^n) * (1 - p)^(N-n)
14   i = i + 1
15  }
  • We can display the probability of obtaining between 40 and 49 red numbers drawn by summing the individual probabilities. The result is approximately 53.5 percent:

    sum(v40_49)
  • We can do the same for the probability of obtaining between 51 and 60 red numbers. The result is approximately 34.7 percent:

    sum(v51_60)

    So it would be better to bet on black, right ? Well, this wouldn't make a difference. Because of the presence of the 0, which isn't black nor red, the probabilities of winning a several bets is strongly lower than the probabilities of losing, as the risk of losing each trial is higher than the chances of winning (19/37 versus 18/37).

  • The correlation between petal width and petal length can be computed as follows:

    cor.test(iris$Petal.Width,iris$Petal.Length)

    The correlation between these attributes is around 0.963, and is significant at p < .001.

Chapter 9 – Linear Regression

  • We examine the effect of work-family conflict (attribute WFC) on work satisfaction (WorkSat) in the first model called model01:

    model01 = lm(WorkSat ~ WFC, data = nurses)
  • We create a second model called model02, in which you include WFC and exhaustion (Exhaus) as predictors of WorkSat:

    model02 = lm(WorkSat ~ WFC + Exhaus, data = nurses)
  • We can check the relationship between WFC and Worksat in both models using the following code. We notice that the effect of WFC on work satisfaction is significant in model01, but not anymore in model02, that is when exhaustion is included in the model.

    summary(model01)
    summary(model02)
  • Here is the code for computing model03 with WFC included as a predictor of exhaustion:

    model03 = lm(Exhaus ~ WFC, data = nurses)

    The summary of model03 shows that WFC is a significant predictor of Exhaus:

    summary(model03)
  • We therefore perform a Sobel test for the mediation of the relationship between WFC and WorkSat by Exhaus as follows:

    library(bda)
    mediation.test(nurses$Exhaus, nurses$WFC, nurses$WorkSat)

    We can see that the value for the Sobel test is significant, which attests to the presence of the mediation

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

The following code will generate the kappas values for numbers of neighbors ranging from 2 to 10 (the first value is for two neighbors, the second for three):

1  library(psych); library(class)
2  kappas = rep(0,9)
3  for (i in 1:9) {
4     set.seed(2222)
5     pred.train = knn(train = TRAIN[,2:13], test = TRAIN[,2:13],
6        cl = TRAIN$season, k = i+1)
7     tab = table(pred.train,TRAIN[,14])
8     kappas[i] = cohen.kappa(tab)[1]
9  }

Using the following code, we notice that the two clusters' solution has the highest kappa:

which.max(kappas)

We therefore use this solution to predict our data in the testing set and compute the kappa value, which is only 0.19 (that's pretty bad!):

pred.test = knn(train = TRAIN[,2:13], test = TEST[,2:13],
   cl = TRAIN$season, k = 2)
cohen.kappa(table(pred.test,TEST[,14]))[1]

Chapter 11 – Classification Trees

The C4.5 model is generated as follows:

library(RWeka)
IRIS.C45 = J48(Species~ . , data= IRIStrain, 
   control= Weka_control(U=FALSE))

We compute the predictions as follows:

C45.preds = predict(IRIS.C45, IRIStest)

The CART model is generated as follows:

library(rpart)
IRIS.CART = rpart(Species ~. , data= IRIStrain)

The following line of code computes a data frame with the predicted probabilities:

CART.probs = as.data.frame(predict(IRIS.CART, IRIStest))

We now obtain the column number in which the predicted probabilities are the highest for each case:

CART.preds = apply(CART.probs, 1, which.max)

Finally, we assign the name of that column to each observation:

for (i in 1:length(CART.preds)) {
 col= as.numeric(CART.preds[i]) 
 CART.preds[i] = names(CART.probs[col])
}

Let's obtain the confusion matrix for both classifications:

CONF.C45 = table(IRIStest$Species, C45.preds)
CONF.CART = table(IRIStest$Species, CART.preds)

We create a function that computes the accuracy of a confusion matrix for us:

acc = function(table) {
   sum(diag(dim(table)[1])*table) / sum(CONF.C45)
}

We can see that, for this dataset, CART has a little advantage in accuracy (0.95 versus 0.92):

acc(CONF.C45)
acc(CONF.CART)

Chapter 12 – Multilevel Analyses

We obtain the graph for the relationship between Exhaust and WorkSat as follows:

library(lattice)
attach(NursesML)
xyplot(WorkSat~Exhaust | as.factor(hosp), panel = function(x,y) { 
    panel.xyplot(x,y) 
    panel.lmline(x,y)
})

We generate the second graph as follows:

xyplot(WorkSat~Depers | as.factor(hosp), panel = function(x,y) { 
     panel.xyplot(x,y) 
     panel.lmline(x,y)
})

We obtain the answer to questions 2 and 3 using the following line of code. The first value is the intercept, the second value is the coefficient for the predicted value (the increase on the actual work satisfaction for an increase of 1 in the predicted value):

coeffs(modelPred)

Chapter 13 – Text Analytics with R

This is easily done as follows. For the original corpus type the following code:

DTM = as.matrix(DocumentTermMatrix(acq))
FR = colSums(DTM)
FR[FR>100]

For the preprocessed corpus type the following code:

acq2 = preprocess(acq)
DTM2 = as.matrix(DocumentTermMatrix(acq2))
FR2 = colSums(DTM2)
FR2[FR2>100]

We obtain the graph as follows:

barplot(sort(FR2[FR2>30]), names.arg = rownames(sort(FR2[FR2>30])))