Book Image

Regression Analysis with Python

By : Luca Massaron, Alberto Boschetti
4 (1)
Book Image

Regression Analysis with Python

4 (1)
By: Luca Massaron, Alberto Boschetti

Overview of this book

Regression is the process of learning relationships between inputs and continuous outputs from example data, which enables predictions for novel inputs. There are many kinds of regression algorithms, and the aim of this book is to explain which is the right one to use for each set of problems and how to prepare real-world data for it. With this book you will learn to define a simple regression problem and evaluate its performance. The book will help you understand how to properly parse a dataset, clean it, and create an output matrix optimally built for regression. You will begin with a simple regression algorithm to solve some data science problems and then progress to more complex algorithms. The book will enable you to use regression models to predict outcomes and take critical business decisions. Through the book, you will gain knowledge to use Python for building fast better linear models and to apply the results in Python or in any computer language you prefer.
Table of Contents (16 chapters)
Regression Analysis with Python
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Regression analysis and data science


Imagine you are a developer hastily working on a very cool application that is going to serve thousands of customers using your company's website everyday. Using the available information about customers in your data warehouse, your application is expected to promptly provide a pretty smart and not-so-obvious answer. The answer unfortunately cannot easily be programmatically predefined, and thus will require you to adopt a learning-from-data approach, typical of data science or predictive analytics.

In this day and age, such applications are quite frequently found assisting numerous successful ventures on the Web, for instance:

  • In the advertising business, an application delivering targeted advertisements

  • In e-commerce, a batch application filtering customers to make more relevant commercial offers or an online app recommending products to buy on the basis of ephemeral data such as navigation records

  • In the credit or insurance business, an application selecting whether to proceed with online inquiries from users, basing its judgment on their credit rating and past relationship with the company

There are numerous other possible examples, given the constantly growing number of use cases about machine learning applied to business problems. The core idea of all these applications is that you don't need to program how your application should behave, but you just set some desired behaviors by providing useful examples. The application will learn by itself what to do in any circumstance.

After you are clear about the purpose of your application and decide to use the learning-from-data approach, you are confident that you don't have to reinvent the wheel. Therefore, you jump into reading tutorials and documentation about data science and machine learning solutions applied to problems similar to yours (they could be papers, online blogs, or books talking about data science, machine learning, statistical learning, and predictive analytics).

After reading a few pages, you will surely be exposed to the wonders of many complex machine learning algorithms you likely have never heard of before.

However, you start being puzzled. It isn't simply because of the underlying complex mathematics; it is mostly because of the large amount of possible solutions based on very different techniques. You also often notice the complete lack of any discussion about how to deploy such algorithms in a production environment and whether they would scale up to real-time server requests.

At this point, you are completely unsure where should you start. This is when this book will come to your rescue.

Let's start from the beginning.

Exploring the promise of data science

Given a more interconnected world and the growing availability of data, data science has become quite a hot topic in recent years.

In the past, analytical solutions had strong constrains: the availability of data. Useful data was generally scarce and always costly to obtain and store. Given the current data explosion, now abundant and cheaper information at hand makes learning from data a reality, thus opening the doors to a wide range of predictive applications that were simply impractical before.

In addition, being in an interconnected world, most of your customers are now reachable (and susceptible of being influenced) through the Internet or through mobile devices. This simply means that being smart in developing automated solutions based on data and its predictive powers can directly and almost instantaneously affect how your business works and performs. Being able to reach your customers instantly everywhere, 24 hours a day, 365 days a year, enables your company to turn data into profits, if you know the right things to be done. In the 21st century, data is the new oil of the digital economy, as a memorable and still undisputed article on Wired stated not too long ago (http://www.wired.com/insights/2014/07/data-new-oil-digital-economy/).However, as with oil, data has to be extracted, refined, and distributed.

Being at the intersection of substantive expertise (knowing how to do business and make profits), machine learning (learning from data), and hacking skills (integrating various systems and data sources), data science promises to find the mix of tools to leverage your available data and turn it into profits.

However, there's another side to the coin.

The challenge

Unfortunately, there are quite a few challenging issues in applying data science to a business problem:

  • Being able to process unstructured data or data that has been modeled for completely different purposes

  • Figuring out how to extract such data from heterogeneous sources and integrate it in a timely manner

  • Learning (from data) some effective general rules allowing you to correctly predict your problem

  • Understanding what has been learned and being able to effectively communicate your solution to a non-technical managerial audience

  • Scaling to real-time predictions given big data inputs

The first two points are mainly problems that require data manipulation skills, but from the third point onwards, we really need a data science approach to solve the problem.

The data science approach, based on machine learning, requires careful testing of different algorithms, estimating their predictive capabilities with respect to the problem, and finally selecting the best one to implement. This is exactly what the science in data science means: coming up with various different hypotheses and experimenting with them to find the one that best fits the problem and allows generalization of the results.

Unfortunately, there is no white unicorn in data science; there is no single hypothesis that can successfully fit all the available problems. In other words, we say that there is no free lunch (the name of a famous theorem from the optimization domain), meaning that there are no algorithms or procedures in data science that can always assure you the best results; each algorithm can be less or more successful, depending on the problem.

Data comes in all shapes and forms and reflects the complexity of the world we live in. The existing algorithms should have certain sophistication in order to deal with the complexity of the world, but don't forget that they are just models. Models are nothing but simplifications and approximations of the system of rules and laws we want to successfully represent and replicate for predictive reasons since you can control only what you can measure, as Lord Kelvin said. An approximation should be evaluated based on its effectiveness, and the efficacy of learning algorithms applied to real problems is dictated by so many factors (type of problem, data quality, data quantity, and so on) that you really cannot tell in advance what will work and what won't. Under such premises, you always want to test the simpler solutions first, and follow the principle of Occam's razor as much as possible, favoring simpler models against more complex ones when their performances are comparable.

Sometimes, even when the situation allows the introduction of more complex and more performant models, other factors may still favor the adoption of simpler yet less performant solutions. In fact, the best model is not always necessarily the most performant one. Depending on the problem and the context of application, issues such as ease of implementation in production systems, scalability to growing volumes of data, and performance in live settings, may deeply redefine how important the role of predictive performance is in the choice of the best solution.

In such situations, it is still advisable to use simpler, well-tuned models or easily explainable ones, if they provide an acceptable solution to the problem.

The linear models

In your initial overview of the problem of what machine learning algorithm to use, you may have also stumbled upon linear models, namely linear regression and logistic regression. They both have been presented as basic tools, building blocks of a more sophisticated knowledge that you should achieve before hoping to obtain the best results.

Linear models have been known and studied by scholars and practitioners for a long time. Before being promptly adopted into data science, linear models were always among the basic statistical models to start with in predictive analytics and data mining. They also have been a prominent and relevant tool part of the body of knowledge of statistics, economics, and many other quantitative subjects.

By a simple check (via a query from an online bookstore, from a library, or just from Google Books—https://books.google.com/), you will discover there is quite a vast availability of publications about linear regression. There is also quite an abundance of publications about logistic regression, and about other different variants of the regression algorithm, the so-called generalized linear models, adapted in their formulation to face and solve more complex problems.

As practitioners ourselves, we are well aware of the limits of linear models. However, we cannot ignore their strong positive key points: simplicity and efficacy. We also cannot ignore that linear models are indeed among the most used learning algorithms in applied data science, making them real workhorses in data analysis (in business as well as in many scientific domains).

Far from being the best tool at hand, they are always a good starting point in a data science path of discovery because they don't require hacking with too many parameters and they are very fast to train. Thus, linear models can point out the predictive power of your data at hand, identify the most important variables, and allow you to quickly test useful transformations of your data before applying more complex algorithms.

In the course of this book, you will learn how to build prototypes based on linear regression models, keeping your data treatment and handling pipeline prompt for possible development reiterations of the initial linear model into more powerful and complex ones, such as neural networks or support vector machines.

Moreover, you will learn that you maybe don't even need more complex models, sometimes. If you are really working with lots of data, after having certain volumes of input data feed into a model, using simple or complex algorithms won't matter all that much anymore. They will all perform to the best of their capabilities.

The capability of big data to make even simpler models as effective as a complex one has been pointed out by a famous paper co-authored by Alon Halevy, Peter Norvig, and Fernando Pereira from Google about The Unreasonable Effectiveness of Data (http://static.googleusercontent.com/media/research.google.com/it//pubs/archive/35179.pdf). Before that, the idea was already been known because of a less popular scientific paper by Microsoft researchers, Michele Banko and Eric Brill, Scaling to Very Very Large Corpora for Natural Language Disambiguation (http://ucrel.lancs.ac.uk/acl/P/P01/P01-1005.pdf).

In simple and short words, the algorithm with more data wins most of the time over other algorithms (no matter their complexity); in such a case, it could well be a linear model.

However, linear models can be also helpful downstream in the data science process and not just upstream. As they are fast to train, they are also fast to be deployed and you do not need coding complex algorithms to do so, allowing you to write the solution in any script or programming language you like, from SQL to JavaScript, from Python to C/C++.

Given their ease of implementation, it is not even unusual that, after building complex solutions using neural networks or ensembles, such solutions are reverse-engineered to find a way to make them available in production as a linear model and achieve a simpler and scalable implementation.

What you are going to find in the book

In the following pages, the book will explain algorithms as well as their implementation in Python to solve practical real-world problems.

Linear models can be counted among supervised algorithms, which are those algorithms that can formulate predictions on numbers and classes if previously given some correct examples to learn from. Thanks to a series of examples, you will immediately distinguish if a problem could be tractable using this algorithm or not.

Given the statistical origins of the linear models family, we cannot neglect starting from a statistical perspective. After contextualizing the usage of linear models, we will provide all the essential elements for understanding on what statistical basis and for what purpose the algorithm has been created. We will use Python to evaluate the statistical outputs of a linear model, providing information about the different statistical tests used.

The data science approach is quite practical (to solve a problem for its business impact), and many limitations of the statistical versions of linear models actually do not apply. However, knowing how the R-squared coefficient works or being able to evaluate the residuals of a regression or highlighting the collinearity of its predictors, can provide you with more means to obtain good results from your work in regression modeling.

Starting from regression models involving a single predictive variable, we will move on to consider multiple variables, and from predicting just numbers we will progress to estimating the probability of there being a certain class among two or many.

We will particularly emphasize how to prepare data, both the target variable (a number or a class) to be predicted and the predictors; variables contributing to a correct prediction. No matter what your data is made of, numbers, nouns, text, images, or sounds, we will provide you with the method to correctly prepare your data and transform it in such a way that your models will perform the best.

You will also be introduced to the scientific methodology at the very foundations of data science, which will help you understand why the data science approach is not just simply theoretical but also quite practical, since it allows obtaining models that can really work when applied to real-world problems.

The last pages of the book will cover some of the more advanced techniques for handling big data and complexity in models. We will also provide you with a few examples from relevant business domains and offer plenty of details about how to proceed to build a linear model, validate it, and later on implement it into a production environment.