Book Image

Learning Predictive Analytics with Python

By : Ashish Kumar, Gary Dougan
Book Image

Learning Predictive Analytics with Python

By: Ashish Kumar, Gary Dougan

Overview of this book

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form - It needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Learning to predict who would win, lose, buy, lie, or die with Python is an indispensable skill set to have in this data age. This book is your guide to getting started with Predictive Analytics using Python. You will see how to process data and make predictive models from it. We balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and numpy. You’ll start by getting an understanding of the basics of predictive modeling, then you will see how to cleanse your data of impurities and get it ready it for predictive modeling. You will also learn more about the best predictive modeling algorithms such as Linear Regression, Decision Trees, and Logistic Regression. Finally, you will see the best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world.
Table of Contents (19 chapters)
Learning Predictive Analytics with Python
Credits
Foreword
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
A List of Links
Index

Applications and examples of predictive modelling


In the introductory section, data has been compared with oil. While oil has been the primary source of energy for the last couple of centuries and the legends of OPEC, Petrodollars, and Gulf Wars have set the context for the oil as a begrudged resource; the might of data needs to be demonstrated here to set the premise for the comparison. Let us glance through some examples of predictive analytics to marvel at the might of data.

LinkedIn's "People also viewed" feature

If you are a frequent LinkedIn user, you might be familiar with LinkedIn's "People also viewed" feature.

What it does?

Let's say you have searched for some person who works at a particular organization and LinkedIn throws up a list of search results. You click on one of them and you land up on their profile. In the middle-right section of the screen, you will find a panel titled "People Also Viewed"; it is essentially a list of people who either work at the same organization as the person whose profile you are currently viewing or the people who have the same designation and belong to same industry.

Isn't it cool? You might have searched for these people separately if not for this feature. This feature increases the efficacy of your search results and saves your time.

How is it done?

Are you wondering how LinkedIn does it? The rough blueprint is as follows:

  • LinkedIn leverages the search history data to do this. The model underneath this feature plunges into a treasure trove of search history data and looks at what people have searched next after finding the correct person they were searching for.

  • This event of searching for a particular second person after searching for a particular first person has some probability. This will be calculated using all the data for such searches. The profiles with the highest probability of being searched (based on the historical data) are shown in the "People Also Viewed" section.

  • This probability comes under the ambit of a broad set of rules called Association Rules. These are very widely used in Retail Analytics where we are interested to know what a group of products will sell together. In other words, what is the probability of buying a particular second product given that the consumer has already bought the first product?

Correct targeting of online ads

If you browse the Internet, which I am sure you must be doing frequently, you must have encountered online ads, both on the websites and smartphone apps. Just like the ads in the newspaper or TV, there is a publisher and an advertiser for online ads too. The publisher in this case is the website or the app where the ad will be shown while the advertiser is the company/organization that is posting that ad.

The ultimate goal of an online ad is to be clicked on. Each instance of an ad display is called an impression. The number of clicks per impression is called Click Through Rate and is the single most important metric that the advertisers are interested in. The problem statement is to determine the list of publishers where the advertiser should publish its ads so that the Click Through Rate is the maximum.

How is it done?

  • The historical data in this case will consist of information about people who visited a certain website/app and whether they clicked the published ad or not. Some or a combination of classification models, such as Decision Trees, and Support Vector Machines are used in such cases to determine whether a visitor will click on the ad or not, given the visitor's profile information.

  • One problem with standard classification algorithms in such cases is that the Click Through Rates are very small numbers, of the order of less than 1%. The resulting dataset that is used for classification has a very sparse positive outcome. The data needs to be downsampled to enrich the data with positive outcomes before modelling.

The logistical regression is one of the most standard classifiers for situations with binary outcomes. In banking, whether a person will default on his loan or not can be predicted using logistical regression given his credit history.

Santa Cruz predictive policing

Based on the historical data consisting of the area and time window of the occurrence of a crime, a model was developed to predict the place and time where the next crime might take place.

How is it done?

  • A decision tree model was created using the historical data. The prediction of the model will foretell whether a crime will occur in an area on a given date and time in the future.

  • The model is consistently recalibrated every day to include the crimes that happened during that day.

The good news is that the police are using such techniques to predict the crime scenes in advance so that they can prevent it from happening. The bad news is that certain terrorist organizations are using such techniques to target the locations that will cause the maximum damage with minimal efforts from their side. The good news again is that this strategic behavior of terrorists has been studied in detail and is being used to form counter-terrorist policies.

Determining the activity of a smartphone user using accelerometer data

The accelerometer in a smartphone measures the acceleration over a period of time as the user indulges in various activities. The acceleration is measured over the three axes, X, Y, and Z. This acceleration data can then be used to determine whether the user is sleeping, walking, running, jogging, and so on.

How is it done?

  • The acceleration data is clustered based on the acceleration values in the three directions. The values of the similar activities cluster together.

  • The clustering performs well in such cases if the columns contributing the maximum to the separation of activities are also included while calculating the distance matrix for clustering. Such columns can be found out using a technique called Singular Value Decomposition.

Sport and fantasy leagues

Moneyball, anyone? Yes, the movie. The movie where a statistician turns the fortunes of a poorly performing baseball team, Oak A, by developing an algorithm to select players who were cheap to buy but had a lot of latent potential to perform.

How was it done?

  • Bill James, using historical data, concluded that the older metrics used to rate a player, such as stolen balls, runs batted in, and batting average were not very useful indicators of a player's performance in a given match. He rather relied on metrics like on-base percentage and sluggish percentage to be a better predictor of a player's performance.

  • The chief statistician behind the algorithms, Bill James, compiled the data for performance of all the baseball league players and sorted them for these metrics. Surprisingly, the players who had high values for these statistics also came at cheaper prices.

This way, they gathered an unbeatable team that didn't have individual stars who came at hefty prices but as a team were an indomitable force. Since then, these algorithms and their variations have been used in a variety of real and fantasy leagues to select players. The variants of these algorithms are also being used by Venture Capitalists to optimize and automate their due diligence to select the prospective start-ups to fund.