Book Image

Hands-On Predictive Analytics with Python

By : Alvaro Fuentes
Book Image

Hands-On Predictive Analytics with Python

By: Alvaro Fuentes

Overview of this book

Predictive analytics is an applied field that employs a variety of quantitative methods using data to make predictions. It involves much more than just throwing data onto a computer to build a model. This book provides practical coverage to help you understand the most important concepts of predictive analytics. Using practical, step-by-step examples, we build predictive analytics solutions while using cutting-edge Python tools and packages. The book's step-by-step approach starts by defining the problem and moves on to identifying relevant data. We will also be performing data preparation, exploring and visualizing relationships, building models, tuning, evaluating, and deploying model. Each stage has relevant practical examples and efficient Python code. You will work with models such as KNN, Random Forests, and neural networks using the most important libraries in Python's data science stack: NumPy, Pandas, Matplotlib, Seaborn, Keras, Dash, and so on. In addition to hands-on code examples, you will find intuitive explanations of the inner workings of the main techniques and algorithms used in predictive analytics. By the end of this book, you will be all set to build high-performance predictive analytics solutions using Python programming.
Table of Contents (11 chapters)

The predictive analytics process

There is a common misunderstanding about predictive analytics: that it is all about models. In fact, that is actually just part of doing predictive analytics. Practitioners of the field have established certain standard phrases that different authors refer to by different names. However, the order of the stages is logical and the relationships between them are well understood. In fact, this book has been organized in the logical order of these stages. Here they are:

  1. Problem understanding and definition
  2. Data collection and preparation
  3. Data understanding using exploratory data analysis (EDA)
  4. Model building
  5. Model evaluation
  6. Communication and/or deployment

We will dig deeper into all of them in the following chapters. For now, let's provide a brief overview of what every stage is about. I like to think about each of these phases as having a defined goal.

Problem understanding and definition

Goal: Understand the problem and how the potential solution would look. Also, define the requirements for solving the problem.

This is the first stage in the process. This is a key stage because here we establish together with the stakeholders what the objectives of the predictive model are—which is the problem that needs to be solved and how the solution looks from the business perspective.

In this phase, you also establish explicitly the requirements for the project. The requirements should be in terms of inputs: what the data needed for producing the solution is, in what format it is needed, how much data is needed, and so on. You also discuss what the outputs of the analysis and predictive model will look like and how they provide solutions for the problems that are being discussed. We will discuss much more about this phase in the next chapter.

Data collection and preparation

Goal: Get a dataset that is ready for analysis.

This phase is where we take a look at the data that is available. Depending on the project, you will need to interact with the database administrators and ask them to provide you with the data. You may also need to rely on many different sources to get the data that is needed. Sometimes, the data may not exist yet and you may be part of the team that comes up with a plan to collect it. Remember, the goal of this phase is to have a dataset you will be using for building the predictive model.

In the process of getting the dataset, potential problems with the data may be identified, which is why this phase is, of course, very closely related with the previous one. While performing the tasks for getting the dataset ready, you will go back and forth between this and the former phase as you may realize that the available data is not enough to solve the proposed problem as was formulated in the business understanding phase, so you may need to go back to the stakeholders and discuss the situation and maybe reformulate the problem and solution.

While building the dataset, you may notice some problems with some of the features. Maybe one column has a lot of missing values or the values have not been properly encoded. Although in principle it would be great to deal with problems such as missing values and outliers in this phase, that is often not the case, which is why there isn't a hard boundary between this phase and the next phase: EDA.

Dataset understanding using EDA

Goal: Understand your dataset.

Once you have collected the dataset, it is time for you to start understanding it using EDA which is a combination of numerical and visualization techniques that allow us to understand different characteristics of our dataset, its variables, and the potential relationship between them. The limits between this phase and the previous and next ones are often blurry, so you may think that your dataset is ready for analysis, but when you start your analysis you may realize that you have got five months of historical data from one source and two months from another source, or, for instance, you may find that three features are redundant or that you may need to combine some features to create a new one. So, after a few trips back to the previews phase you may finally get your dataset ready for analysis.

Now it is time for you to start understanding your dataset by starting to answer questions like the following:

  • What types of variables are there in the dataset?
  • What do their distributions look like?
  • Do we still have missing values?
  • Are there redundant variables?
  • What are the relationships between the features?
  • Do we observe outliers?
  • How do the different pairs of features correlate with each other?
  • Do these correlations make sense?
  • What is the relationship between the features and the target?

All the questions that you try to answer in this phase must be guided by the goal of the project: always keep in mind the problem you are trying to solve. Once you have a good understanding of the data, you will be ready for the next phase: model building.

Model building

Goal: Produce some predictive models that solve the problem.

Here is where you build many predictive models that you will then evaluate to pick the best one. You must choose the type of model that will be trained or estimated. The term model training is associated with machine learning and the term estimation is associated with statistics. The approach, type of model, and training/estimation process you will use must be absolutely determined by the problem you are trying to solve and the solution you are looking for.

How to build models with Python and its data science ecosystem is the subject of the majority of this book. We will take a look at different approaches: machine learning, deep learning, Bayesian statistics. After trying different approaches, types of models, and fine-tuning techniques, at the end of this phase you may end up with some models considered to be finalists, and from the most promising ones of which the candidate winner will emerge: the one that will produce the best solution.

Model evaluation

Goal: Choose the best model among a subset of the most promising ones and determine how good the model is in providing the solution.

Here is where you evaluate the subset of "finalists" to see how well they perform. Like every other stage in the process, the evaluation is determined by the problem to be solved. Usually, one or more main metrics will be used to evaluate how good the model performs. Depending on the project, other criteria may be considered when evaluating the model besides metrics, such as computational considerations, interpretability, user-friendliness, and methodology, among others. We will talk in depth about standard metrics and other considerations in Chapter 7, Model Evaluation. As with all the other stages, the criteria and metrics for model evaluation should be chosen considering the problem to be solved.

Please remember that the best model is not the fanciest, the most complex, the most mathematically impressive, the most computationally efficient, or the latest in the research literature: the best model is the one that solves the problem in the best possible way. So, any of the characteristics that we just talked about (fanciness, complexity, and so on) should not be considered when evaluating the model.

Communication and/or deployment

Goal: Use the predictive model and its results.

Finally, the model has been built, tested, and well evaluated: you did it! In the ideal situation, it solves the problem and its performance is great; now it is time to use it. How the model will be used depends on the project; sometimes the results and predictions will be the subject of a report and/or a presentation that will be delivered to key stakeholders, which is what we mean by communication—and, of course, good communication skills are very useful for this purpose.

Sometimes, the model will be incorporated as part of a software application: either web, desktop, mobile, or any other type of technology. In this case, you may need to interact closely with or even be part of the software development team that incorporates the model into the application. There is another possibility: the model itself may become a "data product". For example, a credit scoring application that uses customer data to calculate the chance of the customer defaulting on their credit card. We will produce one example of such data products in Chapter 9, Implementing a Model with Dash.

Although we have enumerated the stages in order, keep in mind that this is a highly iterative, non-linear process and you will be going back and forth between these stages; the frontiers between adjacent phases are blurry and there is always some overlap between them, so it is not important to place every task under some phase. For instance, when dealing with outliers, is it part of the Data collection and preparation phase or of the Dataset understanding phase? In practice, it doesn't matter, you can place it where you want; what matters is that it needs to be done!

Still, knowing the logical sequence of the stages is very useful when doing predictive analytics, as it helps with preparing and organizing the work, and it helps in setting the expectations for the duration of a project. The sequence of stages is logical in the sense that a previous stage is a prerequisite for the next: for example, you can't do model evaluation without having built a model, and after evaluation you may conclude that the model is not working properly so you go back to the Model building phase and come up with another one.

CRISP-DM and other approaches

Another popular framework for doing predictive analytics is the cross-industry standard process for data mining, most commonly known by its acronym, CRISP-DM, which is very similar to what we just described. This methodology is described in Wirth, R. & Hipp, J. (2000). In this methodology, the process is broken into six major phases, shown in the following diagram. The authors clarify that the sequence of the phases is not strict; although the arrows indicate the most frequent relationships between phases, those depend on the particularities of the project or the problem being solved. These are the phases of a predictive analytics project in this methodology:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

There are other ways to look at this process; for example, R. Peng (2016) describes the process using the concept of Epicycles of Data Analysis. For him, the epicycles are the following:

  1. Develop expectations
  2. Collect data
  3. Match expectations with the data
  4. State a question
  5. Exploratory data analysis
  6. Model building
  7. Interpretation
  8. Communication

The word epicycle is used to communicate the fact that these stages are interconnected and that they form part of a bigger wheel that is the data analysis process.