Learning Predictive Analytics with R

Learning Predictive Analytics with R

By : Eric Mayor

Buy this Book

Learning Predictive Analytics with R

By: Eric Mayor

Buy this Book

Overview of this book

This book is packed with easy-to-follow guidelines that explain the workings of the many key data mining tools of R, which are used to discover knowledge from your data. You will learn how to perform key predictive analytics tasks using R, such as train and test predictive models for classification and regression tasks, score new data sets and so on. All chapters will guide you in acquiring the skills in a practical way. Most chapters also include a theoretical introduction that will sharpen your understanding of the subject matter and invite you to go further. The book familiarizes you with the most common data mining tools of R, such as k-means, hierarchical regression, linear regression, association rules, principal component analysis, multilevel modeling, k-NN, Naïve Bayes, decision trees, and text mining. It also provides a description of visualization techniques using the basic visualization tools of R as well as lattice for visualizing patterns in data organized in groups. This book is invaluable for anyone fascinated by the data mining opportunities offered by GNU R and its packages.

Learning Predictive Analytics with R

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Setting GNU R for Predictive Analytics

Installing GNU R

The R graphic user interface

The menu bar of the R console

Packages

Summary

Visualizing and Manipulating Data Using R

The roulette case

Histograms and bar plots

Scatterplots

Boxplots

Line plots

Application – Outlier detection

Formatting plots

Summary

Data Visualization with Lattice

Loading and discovering the lattice package

Discovering multipanel conditioning with xyplot()

Discovering other lattice plots

Updating graphics

Case study – exploring cancer-related deaths in the US

Summary

Cluster Analysis

Distance measures

Learning by doing – partition clustering with kmeans()

Using k-means with public datasets

Summary

Agglomerative Clustering Using hclust()

The inner working of agglomerative clustering

Agglomerative clustering with hclust()

Summary

Dimensionality Reduction with Principal Component Analysis

The inner working of Principal Component Analysis

Learning PCA in R

Summary

Exploring Association Rules with Apriori

Apriori – basic concepts

The inner working of apriori

Analyzing data with apriori in R

Summary

Probability Distributions, Covariance, and Correlation

Probability distributions

Covariance and correlation

Summary

Linear Regression

Understanding simple regression

Working with multiple regression

Analyzing data in R: correlation and regression

Robust regression

Bootstrapping

Summary

Classification with k-Nearest Neighbors and Naïve Bayes

Understanding k-NN

Working with k-NN in R

Understanding Naïve Bayes

Working with Naïve Bayes in R

Computing the performance of classification

Summary

Classification Trees

Understanding decision trees

ID3

C4.5

C5.0

Classification and regression trees and random forest

Conditional inference trees and forests

Installing the packages containing the required functions

Performing the analyses in R

Caret – a unified framework for classification

Summary

Multilevel Analyses

Nested data

Multilevel regression

Multilevel modeling in R

Predictions using multilevel models

Summary

Text Analytics with R

An introduction to text analytics

Loading the corpus

Data preparation

Creating the training and testing data frames

Classification of the reviews

Mining the news with R

Summary

Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML

Cross-validation and bootstrapping of predictive models using the caret package

Exporting models using PMML

Summary

Exercises and Solutions

Exercises

Solutions

Exercises

Here, we provide the exercises for most chapters and recommend that you practice your newly acquired skills after reading each chapter.

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 1 already contains the exercises and solutions.

Chapter 2 – Visualizing and Manipulating Data Using R

Have a look at the following exercises and try to perform the required tasks.

Let's have a little fun now! For this exercise, imagine a player betting on red for 1,000 consecutive trials. You'll have to plot the variations in money throughout the game. Use the isRed attribute of the data frame Data that we built at the beginning of the chapter. The player here starts with $1,000 and bets 1 in every game. The worst possible outcome is leaving with nothing, but also without debts. The line graph you will use has to be wide rather than tall; that is, 10 x 4 inches (use the documentation of the par() function to know how to configure this). Does the player end up winning or losing money?

Chapter 3 – Data Visualization with Lattice

Here, simply plot the relationship between Petal.Length and Petal.Width in the iris dataset and include the regression line.

Chapter 4 – Cluster Analysis

Here, simply determine the best number of clusters in the iris dataset (omit the Species attribute), using several distances measures (use distance = "euclidean", distance = "maximum", and distance = "manhattan"). Always use method = "kmeans". What is the best number of clusters for each distance (use a majority rule). Do the results surprise you?

Chapter 5 – Agglomerative Clustering Using hclust()

Use hclust() to perform clustering on the iris dataset (omit the Species attribute). Use different methods for distance calculation (configurable using the method argument of the dist() function); and different linkage options (configurable using the method argument of the hclust() function).

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

The bfi dataset (in the psych package) contains the responses of 2,800 participants to the Big Five Inventory (http://www.ocf.berkeley.edu/~johnlab/bfi.htm), which measures the five dimensions of personality. This contains 25 items of the inventory, five per dimension of personality: Neuroticism (N1-N5), Extraversion (E1-E5), Conscience (C1-C5), Agreeability (A1-A5), and Openness (O1-O5); as well as the variable's gender, education, and age at the end of the data frame.

Perform the following using the 25 items:

Examine the missing values.
Perform the diagnostics (omit cases with missing values). What do you find?
Run PCA using princomp().
Plot the eigenvalues to determine the number of components to be retained.
Rerun the analysis with that number of components using principal() with the varimax rotation and save the PCA scores.
What is the proportion of cumulative variance explained by all the components?
Name the components by looking at the loadings.
What is the relationship (correlation) between each component and attribute age?

Chapter 7 – Exploring Association Rules with Apriori

Using the ICU dataset, without the attribute race, obtain the association rules with support = 0.1, confidence = 0.8, minlen = 2 containing pco=<=45 as an antecedent. You should obtain 13 rules. Convert the rules object to a data frame. Create an object containing the significance values of fisher's exact test for these rules (rounded to two decimal places), and append it as a column to the data frame you just created. Visualize the relationship between lift and significance of fisher's exact test (the p value) using the plot() function.

Chapter 8 – Probability Distributions, Covariance, and Correlation

Try performing the following exercises:

Adapt the code we used when discussing the binomial distribution to compute the probability of getting a red number in European roulette spins between:
- 40 and 49 times
- 51 and 60 times
Are these numbers different ? If so, or if not, why?
Compute the correlation between petal length and petal width in the iris dataset using the cor.test() function. Is the correlation positive or negative? Is it significant?

Chapter 9 – Linear Regression

Try performing the following exercises:

Using the nurses dataset, examine the effect of a work-family conflict (attribute WFC) on work satisfaction (WorkSat) in the first model called model01.
Create a second model called model02, in which you include WFC and exhaustion (Exhaus) as predictors of WorkSat.

What happens to the relationship between WFC and WorkSat?

Test the relationship between WFC (predictor) and Exhaus (criterion).
If it is significant, perform a sobel test for the mediation of the relationship between WFC and WorkSat by Exhaus.

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

In this exercise, you will try to classify the observations in the Ozone dataset using knn(). The class is the season and is computed (approximately) as follows:

1 library(mlbench)
2 data(Ozone)
3 Oz = na.omit(Ozone)
4 Oz$season = rep("winter",length(Oz[,1]))
5 Oz$season[as.numeric(Oz[[1]])>=3 & as.numeric(Oz[[1]])<=5] 
6   = "spring"
7 Oz$season[as.numeric(Oz[[1]])>=6 & as.numeric(Oz[[1]])<=8]
8   = "summer"
9 Oz$season[as.numeric(Oz[[1]])>=9 & as.numeric(Oz[[1]])<=11] 
10   = "autumn"

You will determine the best number of neighbors on the basis of the kappa value in the training set (higher is better). Finally, based on the kappa value in the testing set with the best number of neighbors, would you trust the classification?

The training and testing datasets are obtained as follows:

1  set.seed(5)
2  Oz$samples = sample(0:1, nrow(Oz), replace =T)
3  TRAIN = subset(Oz, samples == 0)
4  TEST = subset(Oz, samples == 1)

The class (season, the target attribute) is in column 14. Do not include columns 1 and 15 in the analyses. Take care of unlisting the class, for instance, with the unlist() function, if you use subsetting, otherwise, use the df$attribute notation for the class.

Chapter 11 – Classification Trees

Classify the observations in the iris dataset (class is Species) using C4.5 (pruned tree) and CART (using the default arguments). Which produces the best classification in terms of accuracy in the testing set? Create a function that assesses accuracy.

The training and testing sets are generated as follows:

IRIStrain = iris[as.numeric(row.names(iris)) %% 2 == T,]
IRIStest = iris[as.numeric(row.names(iris)) %% 2 == F,]

Chapter 12 – Multilevel Analyses

Try performing the following exercises:

Using the NursesML dataset, visualize whether the relationship between exhaustion (attribute Exhaust) and work satisfaction (WorkSat) varies between hospitals. Include the regression line. Perform the same step for the relationship of depersonalization (Depers) and work satisfaction.
Using the modelPred model, determine which difference in the observed work satisfaction is obtained from an increase of 1 in the predicted values.
What is the intercept of the model (that is, the average value of work satisfaction for the average predicted value)?

Chapter 13 – Text Analytics with R

The tm package contains a corpus of 50 news articles that we access as follows:

data(acq)
acq

What are the terms that occur more than 100 times is this corpus before and after preprocessing with the preprocess() function?

Using the preprocessed data, plot the sorted term frequencies above 10 with terms (row names) on the x axis. Use the barplot() function.

Learning Predictive Analytics with R

By : Eric Mayor

Learning Predictive Analytics with R

By: Eric Mayor

Overview of this book

Related Content you might be interested in

Current Title:

Learning Predictive Analytics with R

Exercises

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 2 – Visualizing and Manipulating Data Using R

Chapter 3 – Data Visualization with Lattice

Chapter 4 – Cluster Analysis

Chapter 5 – Agglomerative Clustering Using hclust()

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

Chapter 7 – Exploring Association Rules with Apriori

Chapter 8 – Probability Distributions, Covariance, and Correlation

Chapter 9 – Linear Regression

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

Chapter 11 – Classification Trees

Chapter 12 – Multilevel Analyses

Chapter 13 – Text Analytics with R