Book Image

Regression Analysis with Python

By : Luca Massaron, Alberto Boschetti
4 (1)
Book Image

Regression Analysis with Python

4 (1)
By: Luca Massaron, Alberto Boschetti

Overview of this book

Regression is the process of learning relationships between inputs and continuous outputs from example data, which enables predictions for novel inputs. There are many kinds of regression algorithms, and the aim of this book is to explain which is the right one to use for each set of problems and how to prepare real-world data for it. With this book you will learn to define a simple regression problem and evaluate its performance. The book will help you understand how to properly parse a dataset, clean it, and create an output matrix optimally built for regression. You will begin with a simple regression algorithm to solve some data science problems and then progress to more complex algorithms. The book will enable you to use regression models to predict outcomes and take critical business decisions. Through the book, you will gain knowledge to use Python for building fast better linear models and to apply the results in Python or in any computer language you prefer.
Table of Contents (16 chapters)
Regression Analysis with Python
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Python for data science


Given the availability of many useful packages for creating linear models and given the fact that it is a programming language quite popular among developers, Python is our language of choice for all the code presented in this book.

Created in 1991 as a general-purpose, interpreted, object-oriented language, Python has slowly and steadily conquered the scientific community and grown into a mature ecosystem of specialized packages for data processing and analysis. It allows you to perform uncountable and fast experiments, easy theory development, and prompt deployments of scientific applications.

As a developer, you will find using Python interesting for various reasons:

  • It offers a large, mature system of packages for data analysis and machine learning. It guarantees that you will get all that you need in the course of a data analysis, and sometimes even more.

  • It is very versatile. No matter what your programming background or style is (object-oriented or procedural), you will enjoy programming with Python.

  • If you don't know it yet, but you know other languages well such as C/C++ or Java, it is very simple to learn and use. After you grasp the basics, there's no better way to learn more than by immediately starting to code.

  • It is cross-platform; your solutions will work perfectly and smoothly on Windows, Linux, and Mac OS systems. You won't have to worry about portability.

  • Although interpreted, it is undoubtedly fast compared to other mainstream data analysis languages such as R and MATLAB (though it is not comparable to C, Java, and the newly emerged Julia language).

  • There are packages that allow you to call other platforms, such as R and Julia, outsourcing some of the computations to them and improving your script performance. Moreover, there are also static compilers such as Cython or just-in-time compilers such as PyPy that can transform Python code into C for higher performance.

  • It can work better than other platforms with in-memory data because of its minimal memory footprint and excellent memory management. The memory garbage collector will often save the day when you load, transform, dice, slice, save, or discard data using the various iterations and reiterations of data wrangling.

Installing Python

As a first step, we are going to create a fully working data science environment you can use to replicate and test the examples in the book and prototype your own models.

No matter in what language you are going to develop your application, Python will provide an easy way to access your data, build your model from it, and extract the right parameters you need to make predictions in a production environment.

Python is an open source, object-oriented, cross-platform programming language that, compared with its direct competitors (for instance, C/C++ and Java), produces very concise and very readable code. It allows you to build a working software prototype in a very short time, to maintain it easily, and to scale it to larger quantities of data. It has become the most used language in the data scientist's toolbox because it is a general-purpose language made very flexible thanks to a large variety of available packages that can easily and rapidly help you solve a wide spectrum of both common and niche problems.

Choosing between Python 2 and Python 3

Before starting, it is important to know that there are two main branches of Python: version 2 and 3. Since many core functionalities have changed, scripts built for one versions are often incompatible (they won't work without raising errors and warnings) with the other one. Although the third version is the newest, the older one is still the most used version in the scientific area, and the default version for many operating systems (mainly for compatibility in upgrades). When version 3 was released in 2008, most scientific packages weren't ready, so the scientific community was stuck with the previous version. Fortunately, since then, almost all packages have been updated, leaving just a few orphans of Python 3 compatibility (see http://py3readiness.org/ for a compatibility overview).

In this book, which should address a large audience of developers, we agreed that it would have been better to work with Python 3 rather than the older version. Python 3 is the future of Python; in fact, it is the only version that will be further developed and improved by the Python foundation. It will be the default version of the future. If you are currently working with version 2 and you prefer to keep on working with it, we suggest you to run these following few lines of code at the beginning every time you start the interpreter. By doing so, you'll render Python 2 capable of executing most version 3 code with minimal or no problems at all (the code will patch just a few basic incompatibilities, after installing the future package using the command pip install future, and let you safely run all the code in this book):

from __future__ import unicode_literals 
# to make all string literals into unicode strings
from __future__ import print_function # To print multiple strings
from six import reraise as raise_ # Raising exceptions with a traceback
from __future__ import division # True division
from __future__ import absolute_import # Flexible Imports

Tip

The from __future__ import commands should always occur at the beginning of your script or you may experience Python reporting an error.

Step-by-step installation

If you have never used Python (but that doesn't mean that you may not already have it installed on your machine), you need to first download the installer from the main website of the project, https://www.python.org/downloads/ (remember, we are using version 3), and then install it on your local machine.

This section provides you with full control over what can be installed on your machine. This is very useful when you are going to use Python as both your prototyping and production language. Furthermore, it could help you keep track of the versions of packages you are using. Anyway, please be warned that a step-by-step installation really takes time and effort. Instead, installing a ready-made scientific distribution will lessen the burden of installation procedures and may well facilitate initial learning because it can save you quite a lot of time, though it will install a large number of packages (that for the most part you may never use) on your computer all at once. Therefore, if you want to start immediately and don't need to control your installation, just skip this part and proceed to the next section about scientific distributions.

As Python is a multiplatform programming language, you'll find installers for computers that either run on Windows or Linux/Unix-like operating systems. Please remember that some Linux distributions (such as Ubuntu) already have Python packed in the repository, which makes the installation process even easier:

  1. Open a Python shell, type python in the terminal or click on the Python IDLE icon. Then, to test the installation, run the following code in the Python interactive shell or REPL:

    >>> import sys
    >>> print (sys.version)
    

    Tip

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

If a syntax error is raised, it means that you are running Python 2 instead of Python 3. Otherwise, if you don't experience an error and you read that your Python version is 3.x (at the time of writing this book, the latest version was 3.5.0), then congratulations on running the version of Python we elected for this book.

To clarify, when a command is given in the terminal command line, we prefix the command with $>. Otherwise, if it's for the Python REPL, it's preceded by >>>.

Installing packages

Depending on your system and past installations, Python may not come bundled with all you need, unless you have installed a distribution (which, on the other hand, is usually stuffed with much more than you may need).

To install any packages you need, you can use the commands pip or easy_install; however, easy_install is going to be dropped in the future and pip has important advantages over it. It is preferable to install everything using pip because:

  • It is the preferred package manager for Python 3 and, starting with Python 2.7.9 and Python 3.4, it is included by default with the Python binary installers

  • It provides an uninstall functionality

  • It rolls back and leaves your system clear if, for whatever reason, the package installation fails

The command pip runs on the command line and makes the process of installing, upgrading, and removing Python packages simply a breeze.

As we mentioned, if you're running at least Python 2.7.9 or Python 3.4 the pip command should already be there. To verify which tools have been installed on your local machine, directly test with the following command if any error is raised:

$> pip -V

In some Linux and Mac installations, the command is present as pip3 (more likely if you have both Python 2 and 3 on your machine), so, if you received an error when looking for pip, also try running the following:

$> pip3 -V

Alternatively, you can also test if the old command easy_install is available:

$> easy_install --version

Tip

Using easy_install in spite of pip's advantages makes sense if you are working on Windows because pip will not install binary packages (it will try to build them); therefore, if you are experiencing unexpected difficulties installing a package, easy_install can save your day.

If your test ends with an error, you really need to install pip from scratch (and in doing so, also easy_install at the same time).

To install pip, simply follow the instructions given at https://pip.pypa.io/en/stable/installing/. The safest way is to download the get-pi.py script from https://bootstrap.pypa.io/get-pip.py and then run it using the following:

$> python get-pip.py

By the way, the script will also install the setup tool from https://pypi.python.org/pypi/setuptools, which contains easy_install.

As an alternative, if you are running a Debian/Ubuntu Unix-like system, then a fast shortcut would be to install everything using apt-get:

$> sudo apt-get install python3-pip

After checking this basic requirement, you're now ready to install all the packages you need to run the examples provided in this book. To install a generic package, <pk>, you just need to run the following command:

$> pip install <pk>

Alternatively, if you prefer to use easy_install, you can also run the following command:

$> easy_install <pk>

After that, the <pk>package and all its dependencies will be downloaded and installed.

If you are not sure whether a library has been installed or not, just try to import a module inside it. If the Python interpreter raises an Import Error error, it can be concluded that the package has not been installed.

Let's take an example. This is what happens when the NumPy library has been installed:

>>> import numpy

This is what happens if it is not installed:

>>> import numpy 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named numpy

In the latter case, before importing it, you'll need to install it through pip or easy_install.

Take care that you don't confuse packages with modules. With pip, you install a package; in Python, you import a module. Sometimes, the package and the module have the same name, but in many cases they don't match. For example, the sklearn module is included in the package named Scikit-learn.

Package upgrades

More often than not, you will find yourself in a situation where you have to upgrade a package because the new version either is required by a dependency or has additional features that you would like to use. To do so, first check the version of the library you have installed by glancing at the __version__ attribute, as shown in the following example using the NumPy package:

>>> import numpy
>>> numpy.__version__ # 2 underscores before and after
'1.9.2'

Now, if you want to update it to a newer release, say the 1.10.1 version, you can run the following command from the command line:

$> pip install -U numpy==1.10.1

Alternatively, but we do not recommend it unless it proves necessary, you can also use the following command:

$> easy_install --upgrade numpy==1.10.1

Finally, if you are just interested in upgrading it to the latest available version, simply run the following command:

$> pip install -U numpy

You can alternatively also run the easy_install alternative:

$> easy_install --upgrade numpy

Scientific distributions

As you've read so far, creating a working environment is a time-consuming operation for a data scientist. You first need to install Python and then, one by one, you can install all the libraries that you will need (sometimes, the installation procedures may not go as smoothly as you'd hoped for earlier).

If you want to save time and effort and want to ensure that you have a working Python environment that is ready to use, you can just download, install, and use a scientific Python distribution. Apart from Python itself, distributions also include a variety of preinstalled packages, and sometimes they even have additional tools and an IDE set up for your usage. A few of them are very well known among data scientists and, in the sections that follow, you will find some of the key features for two of these packages that we found most useful and practical.

To immediately focus on the contents of the book, we suggest that you first download and install a scientific distribution, such as Anaconda (which is the most complete one around, in our opinion). Then, after practicing the examples in the book, we suggest you to decide to fully uninstall the distribution and set up Python alone, which can be accompanied by just the packages you need for your projects.

Again, if possible, download and install the version containing Python 3.

The first package that we would recommend you try is Anaconda (https://www.continuum.io/downloads), which is a Python distribution offered by Continuum Analytics that includes nearly 200 packages, including NumPy, SciPy, Pandas, IPython, Matplotlib, Scikit-learn, and Statsmodels. It's a cross-platform distribution that can be installed on machines with other existing Python distributions and versions, and its base version is free. Additional add-ons that contain advanced features are charged separately. Anaconda introduces conda, a binary package manager, as a command-line tool to manage your package installations. As stated on its website, Anaconda's goal is to provide enterprise-ready Python distribution for large-scale processing, predictive analytics, and scientific computing.

As a second suggestion, if you are working on Windows, WinPython (http://winpython.sourceforge.net) could be a quite interesting alternative (sorry, no Linux or MacOS versions). WinPython is also a free, open source Python distribution maintained by the community. It is designed with scientists in mind, and it includes many essential packages such as NumPy, SciPy, Matplotlib, and IPython (the same as Anaconda's). It also includes Spyder as an IDE, which can be helpful if you have experience using the MATLAB language and interface. A crucial advantage is that it is portable (you can put it into any directory, or even on a USB flash drive, without the need for any administrative elevation). Using WinPython, you can have different versions present on your computer, move a version from a Windows computer to another, and you can easily replace an older version with a newer one just by replacing its directory. When you run WinPython or its shell, it will automatically set all the environment variables necessary for running Python as it were regularly installed and registered on your system.

Finally, another good choice for a distribution that works on Windows could be Python(x,y). Python(x,y) (http://python-xy.github.io) is a free, open source Python distribution maintained by the scientific community. It includes a number of packages, such as NumPy, SciPy, NetworkX, IPython, and Scikit-learn. It also features Spyder, the interactive development environment inspired by the MATLAB IDE.

Introducing Jupyter or IPython

IPython was initiated in 2001 as a free project by Fernando Perez. It addressed a lack in the Python stack for scientific investigations. The author felt that Python lacked a user programming interface that could incorporate the scientific approach (mainly meaning experimenting and interactively discovering) in the process of software development.

A scientific approach implies fast experimentation with different hypotheses in a reproducible fashion (as do data exploration and analysis tasks in data science), and when using IPython you will be able to more naturally implement an explorative, iterative, trial-and-error research strategy in your code writing.

Recently, a large part of the IPython project has been moved to a new one called Jupyter (http://jupyter.org):

This new project extends the potential usability of the original IPython interface to a wide range of programming languages such as the following:

For a complete list of available kernels, please visit: https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages.

You can use the same IPython-like interface and interactive programming style no matter what language you are developing in, thanks to the powerful idea of kernels, which are programs that run the user's code, as communicated by the frontend interface; they then provide feedback on the results of the executed code to the interface itself.

IPython (Python is the zero kernel, the original starting point) can be simply described as a tool for interactive tasks operable by a console or by a web-based notebook, which offers special commands that help developers to better understand and build the code currently being written.

Contrary to an IDE interface, which is built around the idea of writing a script, running it afterwards, and finally evaluating its results, IPython lets you write your code in chunks, run each of them sequentially, and evaluate the results of each one separately, examining both textual and graphic outputs. Besides graphical integration, it provides further help, thanks to customizable commands, a rich history (in the JSON format), and computational parallelism for enhanced performance when dealing with heavy numeric computations.

In IPython, you can easily combine code, comments, formulas, charts and interactive plots, and rich media such as images and videos, making it a complete scientific sketchpad for all your experimentations and their results together. Moreover, IPython allows reproducible research, allowing any data analysis and model building to be recreated easily under different circumstances:

IPython works on your favorite browser (which could be Explorer, Firefox, or Chrome, for instance) and when started presents a cell waiting for code to written in. Each block of code enclosed in a cell can be run and its results are reported in the space just after the cell. Plots can be represented in the notebook (inline plot) or in a separate window. In our example, we decided to plot our chart inline.

Notes can be written easily using the Markdown language, a very easy and accessible markup language (http://daringfireball.net/projects/markdown).

Such an approach is also particularly fruitful for tasks involving developing code based on data, since it automatically accomplishes the often-neglected duty of documenting and illustrating how data analysis has been done, as well as its premises, assumptions, and intermediate/final results. If part of your job is also to present your work and attract internal or external stakeholders to the project, IPython can really perform the magic of storytelling for you with little additional effort. On the web page https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks, there are many examples, some of which you may find inspiring for your work as we did.

Actually, we have to confess that keeping a clean, up-to-date IPython Notebook has saved us uncountable times when meetings with managers/stakeholders have suddenly popped up, requiring us to hastily present the state of our work.

As an additional resource, IPython offers you a complete library of many magic commands that allow you to execute some useful actions such as measuring the time it takes for a command to execute, or creating a text file with the output of a cell. We distinguish between line magic and cell magic, depending on whether they operate a single line of code or the code contained in an entire cell. For instance, the magic command %timeit measures the time it takes to execute the command on the same line of the line magic, whereas %%time is a cell magic that measures the execution time of an entire cell.

If you want to explore more about magic commands, just type %quickref into an IPython cell and run it: a complete guide will appear to illustrate all available commands.

In short, IPython lets you:

  • See intermediate (debugging) results for each step of the analysis

  • Run only some sections (or cells) of the code

  • Store intermediate results in JSON format and have the ability to version-control them

  • Present your work (this will be a combination of text, code, and images), share it via the IPython Notebook Viewer service (http://nbviewer.ipython.org/), and easily export it into HTML, PDF, or even slide shows

IPython is our favored choice throughout this book, and it is used to clearly and effectively illustrate operations with scripts and data and their consequent results.

Note

For a complete treatise on the full range of IPython functionalities, please refer to the two Packt Publishing books IPython Interactive Computing and Visualization Cookbook, Cyrille Rossant, Packt Publishing, September 25 2014, and Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing, April 25 2013.

For our illustrative purposes, just consider that every IPython block of instructions has a numbered input statement and an output one, so you will find the code presented in this book structured in two blocks, at least when the output is not at all trivial; otherwise just expect only the input part:

In:  <the code you have to enter>
Out: <the output you should get>

Please notice that we do not number the inputs or the outputs.

Though we strongly recommend using IPython, if you are using a REPL approach or an IDE interface, you can use the same instructions and expect identical results (but for print formats and extensions of the returned results).