Book Image

Learning Predictive Analytics with Python

By : Ashish Kumar, Gary Dougan
Book Image

Learning Predictive Analytics with Python

By: Ashish Kumar, Gary Dougan

Overview of this book

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form - It needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Learning to predict who would win, lose, buy, lie, or die with Python is an indispensable skill set to have in this data age. This book is your guide to getting started with Predictive Analytics using Python. You will see how to process data and make predictive models from it. We balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and numpy. You’ll start by getting an understanding of the basics of predictive modeling, then you will see how to cleanse your data of impurities and get it ready it for predictive modeling. You will also learn more about the best predictive modeling algorithms such as Linear Regression, Decision Trees, and Logistic Regression. Finally, you will see the best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world.
Table of Contents (19 chapters)
Learning Predictive Analytics with Python
Credits
Foreword
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
A List of Links
Index

Python and its packages for predictive modelling


In this section, we will discuss some commonly used packages for predictive modelling.

pandas: The most important and versatile package that is used widely in data science domains is pandas and it is no wonder that you can see import pandas at the beginning of any data science code snippet, in this book, and anywhere in general. Among other things, the pandas package facilitates:

  • The reading of a dataset in a usable format (data frame in case of Python)

  • Calculating basic statistics

  • Running basic operations like sub-setting a dataset, merging/concatenating two datasets, handling missing data, and so on

The various methods in pandas will be explained in this book as and when we use them.

Note

To get an overview, navigate to the official page of pandas here: http://pandas.pydata.org/index.html

NumPy: NumPy, in many ways, is a MATLAB equivalent in the Python environment. It has powerful methods to do mathematical calculations and simulations. The following are some of its features:

  • A powerful and widely used a N-d array element

  • An ensemble of powerful mathematical functions used in linear algebra, Fourier transforms, and random number generation

  • A combination of random number generators and an N-d array elements is used to generate dummy datasets to demonstrate various procedures, a practice we will follow extensively, in this book

Note

To get an overview, navigate to official page of NumPy at http://www.NumPy.org/

matplotlib: matplotlib is a Python library that easily generates high-quality 2-D plots. Again, it is very similar to MATLAB.

  • It can be used to plot all kind of common plots, such as histograms, stacked and unstacked bar charts, scatterplots, heat diagrams, box plots, power spectra, error charts, and so on

  • It can be used to edit and manipulate all the plot properties such as title, axes properties, color, scale, and so on

Note

To get an overview, navigate to the official page of matplotlib at: http://matplotlib.org

IPython: IPython provides an environment for interactive computing.

It provides a browser-based notebook that is an IDE-cum-development environment to support codes, rich media, inline plots, and model summary. These notebooks and their content can be saved and used later to demonstrate the result as it is or to save the codes separately and execute them. It has emerged as a powerful tool for web based tutorials as the code and the results flow smoothly one after the other in this environment. At many places in this book, we will be using this environment.

Note

To get an overview, navigate to the official page of IPython here http://ipython.org/

Scikit-learn: scikit-learn is the mainstay of any predictive modelling in Python. It is a robust collection of all the data science algorithms and methods to implement them. Some of the features of scikit-learn are as follows:

  • It is built entirely on Python packages like pandas, NumPy, and matplotlib

  • It is very simple and efficient to use

  • It has methods to implement most of the predictive modelling techniques, such as linear regression, logistic regression, clustering, and Decision Trees

  • It gives a very concise method to predict the outcome based on the model and measure the accuracy of the outcomes

Note

To get an overview, navigate to the official page of scikit-learn here: http://scikit-learn.org/stable/index.html

Python packages, other than these, if used in this book, will be situation based and can be installed using the method described earlier in this section.