Book Image

Practical Predictive Analytics

By : Ralph Winters
Book Image

Practical Predictive Analytics

By: Ralph Winters

Overview of this book

This is the go-to book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a step-by-step installation of RStudio, and a description of the best practices for organizing your projects. On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model. We will continue your journey in the cloud by extending your skill set by learning about Databricks and SparkR, which allow you to develop predictive models on vast gigabytes of data.
Table of Contents (19 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Other helpful tools


Man does not live by bread alone, so it would behave you to learn additional tools in addition to R, so as to advance your analytic skills:

  • SQL: SQL is a valuable tool to know, regardless of which language/package/environment you choose to work in. Virtually every analytics tool will have a SQL interface, and knowledge of how to optimize SQL queries will definitely speed up your productivity, especially if you are doing a lot of data extraction directly from a SQL database. Today's common thought is to do as much pre-processing as possible within the database, so if you will be doing a lot of extracting from databases such as MySQL, PostgreSQL, Oracle, or Teradata, it will be a good thing to learn how queries are optimized within their native framework. In the R language, there are several SQL packages that are useful for interfacing with various external databases. We will be using sqldf, which is a popular R package for interfacing with R dataframes. There are other packages that are specifically tailored for the specific database you will be working with.
  • Web extraction tools: Not every data source will originate from a data warehouse. Knowledge of APIs that extract data from the internet will be valuable to know. Some popular tools include Curl and Jsonlite.
  • Spreadsheets: Despite their problems, spreadsheets are often the fastest way to do quick data analysis and, more importantly, enable you to share your results with others! R offers several interfaces to spreadsheets but, again, learning standalone spreadsheet skills such as pivot tables and Virtual Basic for applications will give you an advantage if you work for corporations in which these skills are heavily used.
  • Data visualization tools: Data visualization tools are great for adding impact to an analysis, and for concisely encapsulating complex information. Native R visualization tools are great, but not every company will be using R. Learn some third-party visualization tools such as D3.js, Google Charts, Qlikview, or Tableau.
  • Big data, Spark, Hadoop, NoSQL database: It is becoming increasingly important to know a little bit about these technologies, at least from the viewpoint of having to extract and analyze data that resides within these frameworks. Many software packages have APIs that talk directly to Hadoop and can run predictive analytics directly within the native environment, or extract data and perform the analytics locally.

Past the basics

Given that the predictive analytics space is so huge, once you are past the basics, ask yourself what area of predictive analytics really interests you, and what you would like to specialize in. Learning all you can about everything concerning predictive analytics is good at the beginning, but ultimately you will be called upon because you are an expert in certain industries or techniques. This could be research, algorithmic development, or even managing analytics teams.

Data analytics/research

But, as general guidance, if you are involved in, or are oriented toward, data, the analytics or research portion of data science, I would suggest that you concentrate on data mining methodologies and specific data modeling techniques that are heavily prevalent in the specific industries that interest you.

For example, logistic regression is heavily used in the insurance industry, but social network analysis is not. Economic research is geared toward time series analysis, but not so much cluster analysis. Recommender engines are prevalent in online retail.

Data engineering

If you are involved more on the data engineering side, concentrate more on data cleaning, being able to integrate various data sources, and the tools needed to accomplish this.

Management

If you are a manager, concentrate on model development, testing and control, metadata, and presenting results to upper management in order to demonstrate value or return on investment.

Team data science

Of course, predictive analytics is becoming more of a team sport, rather than a solo endeavor, and the data science team is very much alive. There is a lot that has been written about the components of a data science team, much of which can be reduced to the three basic skills that I outlined earlier.

Two different ways to look at predictive analytics

Various industries interpret the goals of predictive analytics differently. For example, social science and marketing like to understand the factors which go into a model, and can sacrifice a bit of accuracy if a model can be explained well enough. On the other hand, a black box stock trading model is more interested in minimizing the number of bad trades, and at the end of the day tallies up the gains and losses, not really caring which parts of the trading algorithm worked. Accuracy is more important in the end.

Depending upon how you intend to approach a particular problem, look at how two different analytical mindsets can affect the predictive analytics process:

  1. Minimize prediction error goal: This is a very common use case for machine learning. The initial goal is to predict using the appropriate algorithms in order to minimize the prediction error. If done incorrectly, an algorithm will ultimately fail and it will need to be continually optimized to come up with the new best algorithm. If this is performed mechanically without regard to understanding the model, this will certainly result in failed outcomes. Certain models, especially over optimized ones with many variables, can have a very high prediction rate, but be unstable in a variety of ways. If one does not have an understanding of the model, it can be difficult to react to changes in the data input.
  2. Understanding model goal: This came out of the scientific method and is tied closely to the concept of hypothesis testing. This can be done in certain kinds of models, such as regression and decision trees, and is more difficult in other kinds of models such as Support Vector Machine (SVM) and neural networks. In the understanding model paradigm, understanding causation or impact becomes more important than optimizing correlations. Typically, understanding models has a lower prediction rate, but has the advantage of knowing more about the causations of the individual parts of the model, and how they are related. For example, industries that rely on understanding human behavior emphasize understanding model goals. A limitation to this orientation is that we might tend to discard results that are not immediately understood. It takes discipline to accept a model with lower prediction ability. However, you can also gain model stability

Of course, the previous examples illustrate two disparate approaches. Combination models, which use the best of both worlds, should be the ones we should strive for. Therefore, one goal for a final model is one which:

  • Has an acceptable prediction error
  • Is stable over time
  • Needs a minimum of maintenance
  • Is simple enough to understand and explain.

You will learn later that is this related to Bias/Variance tradeoff.