Mastering Predictive Analytics with Python

Mastering Predictive Analytics with Python

By : Joseph Babcock

Buy this Book

Mastering Predictive Analytics with Python

By: Joseph Babcock

Buy this Book

Overview of this book

The volume, diversity, and speed of data available has never been greater. Powerful machine learning methods can unlock the value in this information by finding complex relationships and unanticipated trends. Using the Python programming language, analysts can use these sophisticated methods to build scalable analytic applications to deliver insights that are of tremendous value to their organizations. In Mastering Predictive Analytics with Python, you will learn the process of turning raw data into powerful insights. Through case studies and code examples using popular open-source Python libraries, this book illustrates the complete development process for analytic applications and how to quickly apply these methods to your own data to create robust and scalable prediction services. Covering a wide range of algorithms for classification, regression, clustering, as well as cutting-edge techniques such as deep learning, this book illustrates not only how these methods work, but how to implement them in practice. You will learn to choose the right approach for your problem and how to develop engaging visualizations to bring the insights of predictive modeling to life

Mastering Predictive Analytics with Python

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

From Data to Decisions – Getting Started with Analytic Applications

Designing an advanced analytic solution

Case study: sentiment analysis of social media feeds

Case study: targeted e-mail campaigns

Summary

Exploratory Data Analysis and Visualization in Python

Exploring categorical and numerical data in IPython

Time series analysis

Working with geospatial data

Introduction to PySpark

Summary

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Similarity and distance metrics

Affinity propagation – automatically choosing cluster numbers

k-medoids

Agglomerative clustering

Streaming clustering in Spark

Summary

Connecting the Dots with Models – Regression Methods

Linear regression

Tree methods

Scaling out with PySpark – predicting year of song release

Summary

Putting Data in its Place – Classification Methods and Analysis

Logistic regression

Fitting the model

Evaluating classification models

Separating Nonlinear boundaries with Support vector machines

Comparing classification methods

Case study: fitting classifier models in pyspark

Summary

Words and Pixels – Working with Unstructured Data

Working with textual data

Principal component analysis

Images

Case Study: Training a Recommender System in PySpark

Summary

Learning from the Bottom Up – Deep Networks and Unsupervised Features

Learning patterns with neural networks

The TensorFlow library and digit recognition

Summary

Sharing Models with Prediction Services

The architecture of a prediction service

Clients and making requests

Server – the web traffic controller

Persisting information with database systems

Case study – logistic regression service

Summary

Reporting and Testing – Iterating on Analytic Systems

Checking the health of models with diagnostics

Iterating on models through A/B testing

Guidelines for communication

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Case study: targeted e-mail campaigns

In our next example, our same marketing department wants to promote new items on their website to users who are mostly likely to be interested in purchasing them. Using a predictive model that includes features from both users and these new items, customers are sent e-mails containing a list of their most probable purchase. Unlike the real-time sentiment-monitoring example, e-mails are sent in batches and use data accumulated over a customer's whole transaction history as inputs to the model, which is a better fit for batch processing.

An overview of the processes used in this example is shown in Figure 7.

Figure 7: Diagram of e-mail targeting case study

Data input and transformation

During the initial data ingestion step, customer records stored in a company's data warehouse (a relational database system) are aggregated to generate features such as the average amount spent per week, frequency with which a customer visits the company's website, and the number of items purchased in a number of categories, such as furniture, electronics, clothing, and media. This is combined with a set of features for the set of items that are potentially promoted in the e-mail campaign, such as price, brand, and the average rating of similar items on the site. These features are constructed through a batch process that runs once per week, before e-mails are sent, on Mondays, to customers.

Sanity checking

The inputs to the model are checked for reasonable values: are the average purchase behaviors or transactions volume of a customer far outside the expected range? These could indicate errors in the data warehouse processing, or bot traffic on the website. Because the transformation logic involved in constructing features for the model is complex and may change over time as the model evolves, its outputs are also checked. For example, the purchase numbers and average prices should never be less than zero, and no category of merchandise should have zero records.

Following scoring of potential items prior to e-mail messaging, the top-scoring items per customer are sanity checked by comparing them to either the customer's historical transactions (to determine if they are sensible), or if no history is available, to the purchases of customers most similar in demographics.

Model development

In this example, the model is a random forest regression Chapter 4, Connecting the Dots with Models – Regression Methods that divides historical items – customer pairs into purchases (labeled 1) and non-purchases (labeled 0) and produces a scored probability that customer A purchases item X. One complexity in this model is that items which haven't been purchased might simply not have been seen by the customer yet, so a restriction is imposed in which the negative examples must be drawn from items already available for a month or more on the website. The hyperparameters of this model (the number and size of each tree) are calibrated during weekly retraining, along with the influence of individual variables on the resulting predictions.

Scoring

After the model is retrained each week using historical data, the set of new items on the website are scored using this model for each customer, and the top three are sent in the e-mail campaign.

Visualization and reporting

Either class of sanity checking (of either input data or model performance) can be part of a regular diagnostics report on the model. Because the random forest model is more complex than other approaches, it is particularly important to monitor changes in feature importance and model accuracy as problems may require more time to debug and resolve.

Because the predictions are used in a production system rather than delivering insights themselves, this reporting is primarily used by the analyst who developed the pipeline rather than the other members of the marketing department.

The success of these promotional e-mails will typically be monitored over the next month, and updates on the accuracy (for example, how many e-mails led to purchases above expected levels) can form the basis of a longer-term report that can help guide both the structure of the campaign itself (for example, varying the number of items in the messages) and the model (perhaps training should be performed more frequently if the predictions seem to become significantly worse between weeks).

Tip

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

Mastering Predictive Analytics with Python

By : Joseph Babcock

Mastering Predictive Analytics with Python

By: Joseph Babcock

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Predictive Analytics with Python

Case study: targeted e-mail campaigns

Data input and transformation

Sanity checking

Model development

Scoring

Visualization and reporting

Tip