Regression Analysis with Python

Regression Analysis with Python

By : Luca Massaron, Alberto Boschetti

4 (1)

Buy this Book

Regression Analysis with Python

4 (1)

By: Luca Massaron, Alberto Boschetti

Buy this Book

Overview of this book

Regression is the process of learning relationships between inputs and continuous outputs from example data, which enables predictions for novel inputs. There are many kinds of regression algorithms, and the aim of this book is to explain which is the right one to use for each set of problems and how to prepare real-world data for it. With this book you will learn to define a simple regression problem and evaluate its performance. The book will help you understand how to properly parse a dataset, clean it, and create an output matrix optimally built for regression. You will begin with a simple regression algorithm to solve some data science problems and then progress to more complex algorithms. The book will enable you to use regression models to predict outcomes and take critical business decisions. Through the book, you will gain knowledge to use Python for building fast better linear models and to apply the results in Python or in any computer language you prefer.

Regression Analysis with Python

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Regression – The Workhorse of Data Science

Regression analysis and data science

Python for data science

Python packages and functions for linear models

Summary

Approaching Simple Linear Regression

Defining a regression problem

Starting from the basics

Extending to linear regression

Minimizing the cost function

Summary

Multiple Regression in Action

Using multiple features

Revisiting gradient descent

Estimating feature importance

Interaction models

Polynomial regression

Summary

Logistic Regression

Defining a classification problem

Defining a probability-based approach

Revisiting gradient descent

Multiclass Logistic Regression

An example

Summary

Data Preparation

Numeric feature scaling

Qualitative feature encoding

Numeric feature transformation

Missing data

Outliers

Summary

Achieving Generalization

Checking on out-of-sample data

Greedy selection of features

Regularization optimized by grid-search

Stability selection

Summary

Online and Batch Learning

Batch learning

Online mini-batch learning

Summary

Advanced Regression Methods

Least Angle Regression

Bayesian regression

SGD classification with hinge loss

Regression trees (CART)

Bagging and boosting

Gradient Boosting Regressor with LAD

Summary

Real-world Applications for Regression Models

Downloading the datasets

A regression problem

An imbalanced and multiclass classification problem

A ranking problem

A time series problem

Summary

Index

Customer Reviews

4 (1)

5 star

4 star

100%

3 star

2 star

1 star

Python packages and functions for linear models

Linear models diffuse in many different scientific and business applications and can be found, under different functions, in quite a number of different Python packages. We have selected a few for use in this book. Among them, Statsmodels is our choice for illustrating the statistical properties of models, and Scikit-learn is instead the package we recommend for easily and seamlessly preparing data, building models, and deploying them. We will present models built with Statsmodels exclusively to illustrate the statistical properties of the linear models, resorting to Scikit-learn to demonstrate how to approach modeling from a data science point of view.

NumPy

NumPy, which is Travis Oliphant's creation, is at the core of every analytical solution in the Python language. It provides the user with multidimensional arrays, along with a large set of functions to operate multiple mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions and that implement mathematical vectors and matrices. Arrays are useful not just for storing data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems.

In the book, we are primarily going to use the module linalg from NumPy; being a collection of linear algebra functions, it will provide help in explaining the nuts and bolts of the algorithm:

Website: http://www.numpy.org/
Import conventions: import numpy as np
Version at the time of print: 1.9.2
Suggested install command: pip install numpy

Tip

As a convention largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np:

import numpy as np

There are importing conventions also for other Python features that we will be using in the code presented in this book.

SciPy

An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more.

The scipy.optimize package provides several commonly used optimization algorithms, used to detail how a linear model can be estimated using different optimization approaches:

Website: http://www.scipy.org/
Import conventions: import scipy as sp
Version at time of print: 0.16.0
Suggested install command: pip install scipy

Statsmodels

Previously part of Scikit, Statsmodels has been thought to be a complement to SciPy statistical functions. It features generalized linear models, discrete choice models, time series analysis, and a series of descriptive statistics as well as parametric and nonparametric tests.

In Statsmodels, we will use the statsmodels.api and statsmodels.formula.api modules, which provide functions for fitting linear models by providing both input matrices and formula's specifications:

Website: http:/statsmodels.sourceforge.net/
Import conventions: import statsmodels.api as sm and import statsmodels.formula.api as smf
Version at the time of print: 0.6.1
Suggested install command: pip install statsmodels

Scikit-learn

Started as part of the SciPy Toolkits (SciKits), Scikit-learn is the core of data science operations on Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Expect us to talk at length about this package throughout the book.

Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at INRA (French Institute for Research in Computer Science and Automation).

Scikit-learn offers modules for data processing (sklearn.preprocessing, sklearn.feature_extraction), model selection, and validation (sklearn.cross_validation, sklearn.grid_search, and sklearn.metrics) and a complete set of methods (sklearn.linear_model) in which the target value, being both a number or a probability, is expected to be a linear combination of the input variables:

Website: http://scikit-learn.org/stable/
Import conventions: None; modules are usually imported separately
Version at the time of print: 0.16.1
Suggested install command: pip install scikit-learn

Tip

Note that the imported module is named sklearn.

Regression Analysis with Python

By : Luca Massaron, Alberto Boschetti

Regression Analysis with Python

By: Luca Massaron, Alberto Boschetti

Overview of this book

Related Content you might be interested in

Current Title:

Regression Analysis with Python

Python packages and functions for linear models

NumPy

Tip

SciPy

Statsmodels

Scikit-learn

Tip