Book Image

Learning Predictive Analytics with R

By : Eric Mayor
Book Image

Learning Predictive Analytics with R

By: Eric Mayor

Overview of this book

This book is packed with easy-to-follow guidelines that explain the workings of the many key data mining tools of R, which are used to discover knowledge from your data. You will learn how to perform key predictive analytics tasks using R, such as train and test predictive models for classification and regression tasks, score new data sets and so on. All chapters will guide you in acquiring the skills in a practical way. Most chapters also include a theoretical introduction that will sharpen your understanding of the subject matter and invite you to go further. The book familiarizes you with the most common data mining tools of R, such as k-means, hierarchical regression, linear regression, association rules, principal component analysis, multilevel modeling, k-NN, Naïve Bayes, decision trees, and text mining. It also provides a description of visualization techniques using the basic visualization tools of R as well as lattice for visualizing patterns in data organized in groups. This book is invaluable for anyone fascinated by the data mining opportunities offered by GNU R and its packages.
Table of Contents (23 chapters)
Learning Predictive Analytics with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Exercises and Solutions
Index

Exercises


Here, we provide the exercises for most chapters and recommend that you practice your newly acquired skills after reading each chapter.

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 1 already contains the exercises and solutions.

Chapter 2 – Visualizing and Manipulating Data Using R

Have a look at the following exercises and try to perform the required tasks.

Let's have a little fun now! For this exercise, imagine a player betting on red for 1,000 consecutive trials. You'll have to plot the variations in money throughout the game. Use the isRed attribute of the data frame Data that we built at the beginning of the chapter. The player here starts with $1,000 and bets 1 in every game. The worst possible outcome is leaving with nothing, but also without debts. The line graph you will use has to be wide rather than tall; that is, 10 x 4 inches (use the documentation of the par() function to know how to configure this). Does the player end up winning or losing money?

Chapter 3 – Data Visualization with Lattice

Here, simply plot the relationship between Petal.Length and Petal.Width in the iris dataset and include the regression line.

Chapter 4 – Cluster Analysis

Here, simply determine the best number of clusters in the iris dataset (omit the Species attribute), using several distances measures (use distance = "euclidean", distance = "maximum", and distance = "manhattan"). Always use method = "kmeans". What is the best number of clusters for each distance (use a majority rule). Do the results surprise you?

Chapter 5 – Agglomerative Clustering Using hclust()

Use hclust() to perform clustering on the iris dataset (omit the Species attribute). Use different methods for distance calculation (configurable using the method argument of the dist() function); and different linkage options (configurable using the method argument of the hclust() function).

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

The bfi dataset (in the psych package) contains the responses of 2,800 participants to the Big Five Inventory (http://www.ocf.berkeley.edu/~johnlab/bfi.htm), which measures the five dimensions of personality. This contains 25 items of the inventory, five per dimension of personality: Neuroticism (N1-N5), Extraversion (E1-E5), Conscience (C1-C5), Agreeability (A1-A5), and Openness (O1-O5); as well as the variable's gender, education, and age at the end of the data frame.

Perform the following using the 25 items:

  • Examine the missing values.

  • Perform the diagnostics (omit cases with missing values). What do you find?

  • Run PCA using princomp().

  • Plot the eigenvalues to determine the number of components to be retained.

  • Rerun the analysis with that number of components using principal() with the varimax rotation and save the PCA scores.

  • What is the proportion of cumulative variance explained by all the components?

  • Name the components by looking at the loadings.

  • What is the relationship (correlation) between each component and attribute age?

Chapter 7 – Exploring Association Rules with Apriori

Using the ICU dataset, without the attribute race, obtain the association rules with support = 0.1, confidence = 0.8, minlen = 2 containing pco=<=45 as an antecedent. You should obtain 13 rules. Convert the rules object to a data frame. Create an object containing the significance values of fisher's exact test for these rules (rounded to two decimal places), and append it as a column to the data frame you just created. Visualize the relationship between lift and significance of fisher's exact test (the p value) using the plot() function.

Chapter 8 – Probability Distributions, Covariance, and Correlation

Try performing the following exercises:

  1. Adapt the code we used when discussing the binomial distribution to compute the probability of getting a red number in European roulette spins between:

    • 40 and 49 times

    • 51 and 60 times

    Are these numbers different ? If so, or if not, why?

  2. Compute the correlation between petal length and petal width in the iris dataset using the cor.test() function. Is the correlation positive or negative? Is it significant?

Chapter 9 – Linear Regression

Try performing the following exercises:

  1. Using the nurses dataset, examine the effect of a work-family conflict (attribute WFC) on work satisfaction (WorkSat) in the first model called model01.

  2. Create a second model called model02, in which you include WFC and exhaustion (Exhaus) as predictors of WorkSat.

What happens to the relationship between WFC and WorkSat?

  1. Test the relationship between WFC (predictor) and Exhaus (criterion).

  2. If it is significant, perform a sobel test for the mediation of the relationship between WFC and WorkSat by Exhaus.

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

In this exercise, you will try to classify the observations in the Ozone dataset using knn(). The class is the season and is computed (approximately) as follows:

1 library(mlbench)
2 data(Ozone)
3 Oz = na.omit(Ozone)
4 Oz$season = rep("winter",length(Oz[,1]))
5 Oz$season[as.numeric(Oz[[1]])>=3 & as.numeric(Oz[[1]])<=5] 
6   = "spring"
7 Oz$season[as.numeric(Oz[[1]])>=6 & as.numeric(Oz[[1]])<=8]
8   = "summer"
9 Oz$season[as.numeric(Oz[[1]])>=9 & as.numeric(Oz[[1]])<=11] 
10   = "autumn"

You will determine the best number of neighbors on the basis of the kappa value in the training set (higher is better). Finally, based on the kappa value in the testing set with the best number of neighbors, would you trust the classification?

The training and testing datasets are obtained as follows:

1  set.seed(5)
2  Oz$samples = sample(0:1, nrow(Oz), replace =T)
3  TRAIN = subset(Oz, samples == 0)
4  TEST = subset(Oz, samples == 1)

The class (season, the target attribute) is in column 14. Do not include columns 1 and 15 in the analyses. Take care of unlisting the class, for instance, with the unlist() function, if you use subsetting, otherwise, use the df$attribute notation for the class.

Chapter 11 – Classification Trees

Classify the observations in the iris dataset (class is Species) using C4.5 (pruned tree) and CART (using the default arguments). Which produces the best classification in terms of accuracy in the testing set? Create a function that assesses accuracy.

The training and testing sets are generated as follows:

IRIStrain = iris[as.numeric(row.names(iris)) %% 2 == T,]
IRIStest = iris[as.numeric(row.names(iris)) %% 2 == F,]

Chapter 12 – Multilevel Analyses

Try performing the following exercises:

  • Using the NursesML dataset, visualize whether the relationship between exhaustion (attribute Exhaust) and work satisfaction (WorkSat) varies between hospitals. Include the regression line. Perform the same step for the relationship of depersonalization (Depers) and work satisfaction.

  • Using the modelPred model, determine which difference in the observed work satisfaction is obtained from an increase of 1 in the predicted values.

  • What is the intercept of the model (that is, the average value of work satisfaction for the average predicted value)?

Chapter 13 – Text Analytics with R

The tm package contains a corpus of 50 news articles that we access as follows:

data(acq)
acq

What are the terms that occur more than 100 times is this corpus before and after preprocessing with the preprocess() function?

Using the preprocessed data, plot the sorted term frequencies above 10 with terms (row names) on the x axis. Use the barplot() function.