Practical Predictive Analytics

Practical Predictive Analytics

By : Ralph Winters

Buy this Book

Practical Predictive Analytics

By: Ralph Winters

Buy this Book

Overview of this book

This is the go-to book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a step-by-step installation of RStudio, and a description of the best practices for organizing your projects. On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model. We will continue your journey in the cloud by extending your skill set by learning about Databricks and SparkR, which allow you to develop predictive models on vast gigabytes of data.

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started with Predictive Analytics

Predictive analytics are in so many industries

Skills and roles that are important in Predictive Analytics

Predictive analytics software

Other helpful tools

How is a predictive analytics project organized?

GUIs

Getting started with RStudio

The R console

The source window

Our first predictive model

Your second script

R packages

References

Summary

The Modeling Process

Advantages of a structured approach

Analytic process methodologies

An analytics methodology outline specific steps

Step 2 data understanding

Step 3 data preparation

Summary

Inputting and Exploring Data

Data input

Joining data

Exploring the hospital dataset

Transposing a dataframe

Missing values

Imputing categorical variables

Outliers

Data transformations

Variable reduction/variable importance

References

Summary

Introduction to Regression Algorithms

Supervised versus unsupervised learning models

Regression techniques

Generalized linear models

Logistic regression

Summary

Introduction to Decision Trees, Clustering, and SVM

Decision tree algorithms

Cluster analysis

Support vector machines

References

Summary

Using Survival Analysis to Predict and Analyze Customer Churn

What is survival analysis?

Our customer satisfaction dataset

Partitioning into training and test data

Setting the stage by creating survival objects

Examining survival curves

Cox regression modeling

Time-based variables

Comparing the models

Variable selection

Summary

Using Market Basket Analysis as a Recommender Engine

What is market basket analysis?

Examining the groceries transaction file

The sample market basket

Association rule algorithms

Antecedents and descendants

Evaluating the accuracy of a rule

Preparing the raw data file for analysis

Analyzing the input file

Scrubbing and cleaning the data

Removing colors automatically

Filtering out single item transactions

Merging the results back into the original data

Compressing descriptions using camelcase

Creating the test and training datasets

Creating the market basket transaction file

Method two Creating a physical transactions file

Converting to a document term matrix

K-means clustering of terms

Predicting cluster assignments

Running the apriori algorithm on the clusters

Summarizing the metrics

References

Summary

Exploring Health Care Enrollment Data as a Time Series

Time series data

Health insurance coverage dataset

Housekeeping

Read the data in

Subsetting the columns

Description of the data

Target time series variable

Saving the data

Determining all of the subset groups

Merging the aggregate data back into the original data

Checking the time intervals

Picking out the top groups in terms of average population size

Plotting the data using lattice

Plotting the data using ggplot

Sending output to an external file

Examining the output

Detecting linear trends

Automating the regressions

Ranking the coefficients

Merging scores back into the original dataframe

Plotting the data with the trend lines

Plotting all the categories on one graph

Performing some automated forecasting using the ets function

Smoothing the data using moving averages

Simple moving average

Verifying the SMA calculation

Exponential moving average

Using the ets function

Forecasting using ALL AGES

Plotting the predicted and actual values

The forecast (fit) method

Plotting future values with confidence bands

Modifying the model to include a trend component

Running the ets function iteratively over all of the categories

Accuracy measures produced by onestep

Comparing the Test and Training for the "UNDER 18 YEARS" group

Accuracy measures

References

Summary

Introduction to Spark Using R

About Spark

Spark environments

SparkR

Building our first Spark dataframe

Importing the sample notebook

Creating a new notebook

Becoming large by starting small

Running the code

Running the initialization code

Extracting the Pima Indians diabetes dataset

Simulating the data

Simulating the negative cases

Running summary statistics

Saving your work

Summary

Exploring Large Datasets Using Spark

Performing some exploratory analysis on positives

Cleaning up and caching the table in memory

Some useful Spark functions to explore your data

Creating new columns

Constructing a cross-tab

Contrasting histograms

Plotting using ggplot

Spark SQL

Exporting data from Spark back into R

Running local R packages

Some tips for using Spark

Summary

Spark Machine Learning - Regression and Cluster Models

About this chapter/what you will learn

Splitting the data into train and test datasets

Spark machine learning using logistic regression

Running predictions for the test data

Combining the training and test dataset

Exposing the three tables to SQL

Validating the regression results

Calculating goodness of fit measures

Confusion matrix for test group

Plotting outside of Spark

Creating some global views

Normalizing the data

Characterizing the clusters by their mean values

Summary

Spark Models – Rule-Based Learning

Loading the stop and frisk dataset

Reading the table

Discovering the important features

Running the OneR model

Another OneR example

Constructing a decision tree using Rpart

Running an alternative model in Python

Indexing the classification features

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

R packages

An R package extends the functionality of basic R. Base R, by itself, is very capable, and you can do an incredible amount of analytics without adding any additional packages. However adding a package may be beneficial if it adds a functionality which does not exist in base R, improves or builds upon an existing functionality, or just makes something that you can already do easier.

For example, there are no built in packages in base R which enable you to perform certain types of machine learning (such as Random Forests). As a result, you need to search for an add on package which performs this functionality. Fortunately you are covered. There are many packages available which implement this algorithm.

Bear in mind that there are always new packages coming out. I tend to favor packages which have been on CRAN for a long time and have large user base. When installing something new, I will try to reference the results against other packages which do similar things. Speed is another reason to consider adopting a new package.

The stargazer package

For an example of a package which can just make life easier, first lets consider the output produced by running a summary function on the regression results, as we did previously. You can run it again if you wish.

summary(lm_output)

The amount of statistical information output by the summary() function can be overwhelming to the initiated. This is not only related to the amount of output, but the formatting. That is why I did not show the entire output in the previous example.

One way to make output easier to look at is to first reduce the amount of output that is presented, and then reformat it so it is easier on the eyes.

To accomplish this, we can utilize a package called stargazer, which will reformat the large volume of output produced by summary() function and simplify the presentations. Stargazer excels at reformatting the output of many regression models, and displaying the results as HTML, PDF, Latex, or as simple formatted text. By default, it will show you the most important statistical output for various models, and you can always specify the types of statistical output that you want to see.

To obtain more information on the stargazer package you can first go to CRAN, and search for documentation about stargazer package, and/or you can use the R help system:

IF you already have installed stargazer you can use the following command:

packageDescription("stargazer")

If you havent installed the package, information about stargazer, (or other packages) can also be found using R specific internet searches:

RSiteSearch("stargazer")

If you like searching for documentation within R, you can obtain more information about the R help system at:

https://www.r-project.org/help.html

Installing stargazer package

Now, on to installing stargazer:

First create a new R script (File | New File | R Script).
Enter the following lines and then select Source from the menu bar in the code pane, which will submit the entire script:

        install.packages("stargazer") 
        library(stargazer) 
        stargazer(lm_output, , type="text")

After the script has been run, the following should appear in the Console:

Code description

Here is a line by line description of the code which you have just run:

install.packages("stargazer"): The line will install the package to the default package directory on your machine. If you will be rerunning this code again, you can comment out this line, since the package will have already be installed in your R repository.
library(stargazer): Installing a package does not make the package automatically available. You need to run a library (or require()) function in order to actually load the stargazer package.
stargazer(lm_output, , type="text"): This line will take the output object lm_output, that was created in the first script, condense the output, and write it out to the console in a simpler, more readable format. There are many other options in the stargazer library, which will format the output as HTML, or Latex.

Please refer to the reference manual at https://cran.r-project.org/web/packages/stargazer/index.html for more information.

The reformatted results will appear in the R Console. As you can see, the output written to the console is much cleaner and easier to read.

Saving your work

After you are done, select File | File Save from the menu bar.

Then navigate to the PracticalPredictiveAnalytics/Outputs folder that was created, and name it Chapter1_LinearRegressionOutput. Press Save.

Practical Predictive Analytics

By : Ralph Winters

Practical Predictive Analytics

By: Ralph Winters

Overview of this book

Related Content you might be interested in

Current Title:

Practical Predictive Analytics

Machine Learning with R Cookbook

Big Data Analytics with Hadoop 3

Hands-On Ensemble Learning with R

R packages

The stargazer package

Installing stargazer package

Code description

Saving your work