This is a different kind of predictive analytics book. My original intention was to introduce predictive analytics techniques targeted towards legacy analytics folks, using open source tools.

However, I soon realized that they were certain aspects of legacy analytics tools that could benefit the new generation of data scientists. Having worked a large part of my career in enterprise data solutions, I was interested in writing about some different kinds of topics, such as analytics methodologies, agile, metadata, SQL analytics, and reproducible research, which are often neglected in some data science/predictive analytics books, but still critical to the success of analytics project.

I also wanted to write about some underrepresented analytics techniques that extend beyond standard regression and classification tasks, such as using survival analysis to predict customer churn, and using market basket analysis as a recommendation engine.

Since there is a lot of movement towards cloud-based solutions, I thought it was important to include some chapters on cloud based analytics (big data), so I included several chapters on developing predictive analytics solutions within a Spark environment.

Whatever your orientation is, a key point of this book is collaboration, and I hope that regardless of your definition of data science, predictive analytics, big data, or even a benign term such as forecasting, you will find something here that suits your needs.

Furthermore, I wanted to pay homage to the domain expert as part of the data science team. Often, these analysts are not given fancy titles, but business analysts, can make the difference between a successful analytics project and one that falls flat on its face. Hopefully, some of the topics I discuss will strike a chord with them, and get them more interested in some of the technical concepts of predictive analytics.

When I was asked by Packt to write a book about predictive analytics, I first wondered what would be a good open source language to bridge the gap between legacy analytics and today's data scientist world. I thought about this considerably, since each language brings its own nuances in terms of how solutions to problems are expressed. However, I decided ultimately not to sweat the details, since predictive analytics concepts are not language-dependent, and the choice of language often is determined by personal preference as well as what is in use within the company in which you work.

I chose the R language because my background is in statistics, and I felt that R had good statistical rigor and now has reasonable integration with propriety software such as SAS, and also has good integration with relational database systems, as well as web protocols. It also has an excellent plotting and visualization system, and along with its many good user contributed packages, covers most statistical and predictive analytics tasks.

Regarding statistics, I suggest that you learn as much statistics as you can. Knowing statistics can help you separate good models from bad, and help you identify many problems in bad data just by understanding basic concepts such as measures of central tendencies (mean, median, mode), hypothesis testing, p-values, and effect sizes. It will also help you shy away from merely running a package in an automated way, and help you look a little at what is under the hood.

One downside to R is that it processes data in memory, so the software can limit the size of potentially larger datasets when used on a single PC. For the datasets we use in this book, there should be no problems running R on a single PC. If you are interested in analyzing big data, I do spend several chapters discussing R and Spark within a cloud environment, in which you can processes very large datasets that are distributed between many different computers.

Speaking of the datasets used in this book, I did not want to use the same datasets that you see analyzed repeatedly. Some of these are datasets are excellent for demonstrating techniques, but I wanted some alternatives. However, I did not see a whole lot of them that I thought would be useful for this book. Some were from unknown sources, some needed formal permission to use, some lacked a good data dictionary. So, for many chapters, I ended up generating my own data using simulation techniques in R. I believe that was a good choice, since it enabled me to introduce some data generating techniques that you can use in your own work.

The data I used covers a good spectrum of marketing, retail and healthcare applications. I also would have liked to include some financial predictive analytics use cases but ran out of time. Maybe I will leave that for another book!

Chapter 1, *Getting Started with Predictive Analytics*, begins with a little bit of history of how predictive analytics developed. We then discuss some different roles of predictive analytics practitioners, and describe the industries in which they work. Ways to organize predictive analytic projects on a PC is discussed next, the R language is introduced, and we end the chapter with a short example of a predictive model.

Chapter 2, *The Modeling Process*, discusses how the development of predictive models can be organized into a series of stages, each with different goals, such as exploration and problem definition, leading to the actual development of a predictive model. We discuss two important analytics methodologies, CRISP-DM and SEMMA. Code examples are sprinkled through the chapter to demonstrate some of the ideas central to the methodologies, so you will hopefully, never be bored...

Chapter 3, *Inputting and Exploring Data*, introduces various ways that you can bring your own input data into R. We also discuss various data preparation techniques using standard SQL functions as well as analogous methods using the R dplyr package. Have no data to input? No problem. We will show you how to generate your own human-like data using the R package wakefield.

Chapter 4, *Introduction to Regression Algorithms*, begins with a discussion of supervised versus unsupervised algorithms. The rest of the chapter concentrates on regression algorithms, which represent the supervised algorithm category. You will learn about interpreting regression output such as model coefficients and residual plots. There is even an interactive game that supplies an interact test to see if you can determine if a series of residuals are random or not.

Chapter 5, *Introduction to Decision trees, Clustering, and SVM*, concentrates on three other core predictive algorithms that have widespread use, and, along with regression, can be used to solve many, if not most, of your predictive analytics problems. The last algorithm discussed, Support Vector Machines (SVMs), are often used with high-dimensional data, such as unstructured text, so we will accompany this example with some text mining techniques using some customer complaint comments.

Chapter 6, *Using Survival Analysis to Predict and Analyze Customer Churn*, discusses a specific modeling technique known as survival analysis and follows a hypothetical customer marketing satisfaction and retention example. We will also delve more deeply into simulating customer choice using some sampling functions available in R.

Chapter 7, *Using Market Basket Analysis as a Recommender Engine*, introduces the concept of association rules and market basket analysis, and steps you through some techniques that can predict future purchases based upon various combinations of previous purchases from an online retail store. It also introduces some text analytics techniques coupled with some cluster analysis that places various customers into different segments. You will learn some additional data cleaning techniques, and learn how to generate some interesting association plots.

Chapter 8, *Exploring Health Care Enrollment Data as a Time Series*, introduces time series analytics. Healthcare enrollment data from the CMS website is first explored. Then we move on to defining some basic time series concepts such as simple and exponential moving averages. Finally, we work with the R forecast package which, as its name implies, helps you to perform some time series forecasting.

Chapter 9, *Introduction to Spark Using R*, introduces RSpark, which is an environment for accessing large Spark clusters using R. No local version of R needs to be installed. It also introduces Databricks, which is a cloud-based environment for running R (as well as Python, SQL, and other language), against Spark-based big data. This chapter also demonstrates techniques for transforming small datasets into larger Spark clusters using the Pima Indians Diabetes database as reference.

Chapter 10, *Exploring Large Datasets Using Spark*, shows how to perform some exploratory data analysis using a combination of RSpark and Spark SQL using the Pima Indians Diabetes data loaded into Spark. We will learn the basics of exploring Spark data using some Spark-specific commands that allow us to filter, group and summarize, and visualize our Spark data.

Chapter 11, *Spark Machine Learning Regression and Cluster Models*, covers machine learning by first illustrating a logistic regression model that has been built using a Spark cluster. We will learn how to split Spark data into training and test data in Spark, run a logistic regression model, and then evaluate its performance.

Chapter 12, *Spark Models - Rules-Based Learning*, teaches you how to run decision tree models in Spark using the Stop and Frisk dataset. You will learn how to overcome some of the algorithmic limitations of the Spark MLlib environment by extracting some cluster samples to your local machine and then run some non-Spark algorithms that you are already familiar with. This chapter will also introduce you to a new rule-based algorithm, OneR, and will also demonstrate how you can mix different languages together in Spark, such as mixing R, SQL, and even Python code together in the same notebook using the %magic directive.

This is neither an introductory predictive analytics book, not an introductory book for learning R or Spark. Some knowledge of base R data manipulation techniques is expected. Some prior knowledge of predictive analytics is useful. As mentioned earlier, knowledge of basic statistical concepts such as hypothesis testing, correlation, means, standard deviations, and p-values will also help you navigate this book.

This book is for those who have already had an introduction to R, and are looking to learn how to develop enterprise predictive analytics solutions. Additionally, traditional business analysts and managers who wish to extend their skills into predictive analytics using open source R may find the book useful. Existing predictive analytic practitioners who know another language, or those who wish to learn about analytics using Spark, will also find the chapters on Spark and R beneficial.

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:

"Save all output to the `/PracticalPredictiveAnalytics/Outputs`

directory."

A block of code is set as follows:

#run the model model <- OneR(train_data, frisked ~ ., verbose = TRUE) #summarize the model summary(model) #run the sql function from the SparkR package SparkR::sql("SELECT sample_bin , count(*) \FROM out_tbl group by sample_bin")

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

```
#note we are specifing the SparkR filter, not the dplyr filer
head(
```**SparkR::filter**(out_sd1,out_sd1$sample_bin==1),1000)

Any command-line, (including commands at the R console) input or output is written as follows:

**> summary(xchurn)**

**New terms** and **important words** are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Clicking the ** Next** button moves you to the next screen."

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail `[email protected]`

, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

- Log in or register to our website using your e-mail address and password.
- Hover the mouse pointer on the
tab at the top.`SUPPORT`

- Click on
.`Code Downloads & Errata`

- Enter the name of the book in the
box.`Search`

- Select the book for which you're looking to download the code files.
- Choose from the drop-down menu where you purchased this book from.
- Click on
.`Code Download`

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

- WinRAR / 7-Zip for Windows
- Zipeg / iZip / UnRarX for Mac
- 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Predictive-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PracticalPredictiveAnalytics_ColorImages.pdf.

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the ** Errata Submission Form** link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the ** Errata** section.

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at `[email protected]`

with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

If you have a problem with any aspect of this book, you can contact us at `[email protected]`

, and we will do our best to address the problem.