Book Image

Regression Analysis with Python

By : Luca Massaron, Alberto Boschetti
4 (1)
Book Image

Regression Analysis with Python

4 (1)
By: Luca Massaron, Alberto Boschetti

Overview of this book

Regression is the process of learning relationships between inputs and continuous outputs from example data, which enables predictions for novel inputs. There are many kinds of regression algorithms, and the aim of this book is to explain which is the right one to use for each set of problems and how to prepare real-world data for it. With this book you will learn to define a simple regression problem and evaluate its performance. The book will help you understand how to properly parse a dataset, clean it, and create an output matrix optimally built for regression. You will begin with a simple regression algorithm to solve some data science problems and then progress to more complex algorithms. The book will enable you to use regression models to predict outcomes and take critical business decisions. Through the book, you will gain knowledge to use Python for building fast better linear models and to apply the results in Python or in any computer language you prefer.
Table of Contents (16 chapters)
Regression Analysis with Python
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Greedy selection of features


By following our experiments throughout the book, you may have noticed that adding new variables is always a great success in a linear regression model. That's especially true for training errors and it happens not just when we insert the right variables but also when we place the wrong ones. Puzzlingly, when we add redundant or non-useful variables, there is always a more or less positive impact on the fit of the model.

The reason is easily explained; since regression models are high-bias models, they find it beneficial to augment their complexity by increasing the number of coefficients they use. Thus, some of the new coefficients can be used to fit the noise and other details present in data. It is precisely the memorization/overfitting effect we discussed before. When you have as many coefficients as observations, your model can become saturated (that's the technical term used in statistics) and you could have a perfect prediction because basically you have...