Hands-On Data Science and Python Machine Learning

By : Frank Kane

Hands-On Data Science and Python Machine Learning

By: Frank Kane

Overview of this book

Join Frank Kane, who worked on Amazon and IMDb’s machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them. Based on Frank’s successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis.

Preface

Free Chapter

Getting Started

Installing Enthought Canopy

Using and understanding IPython (Jupyter) Notebooks

Python basics - Part 1

Understanding Python code

Importing modules

Python basics - Part 2

Running Python scripts

Summary

Statistics and Probability Refresher, and Python Practice

Types of data

Mean, median, and mode

Using mean, median, and mode in Python

Standard deviation and variance

Probability density function and probability mass function

Types of data distributions

Percentiles and moments

Summary

Matplotlib and Advanced Probability Concepts

A crash course in Matplotlib

Covariance and correlation

Conditional probability

Bayes' theorem

Summary

Predictive Models

Linear regression

Polynomial regression

Multivariate regression and predicting car prices

Multi-level models

Summary

Machine Learning with Python

Machine learning and train/test

Using train/test to prevent overfitting of a polynomial regression

Bayesian methods - Concepts

Implementing a spam classifier with Naïve Bayes

K-Means clustering

Clustering people based on income and age

Measuring entropy

Decision trees - Concepts

Decision trees - Predicting hiring decisions using Python

Ensemble learning

Support vector machine overview

Using SVM to cluster people by using scikit-learn

Summary

Recommender Systems

What are recommender systems?

Item-based collaborative filtering

How item-based collaborative filtering works?

Finding movie similarities

Improving the results of movie similarities

Making movie recommendations to people

Improving the recommendation results

Summary

More Data Mining and Machine Learning Techniques

K-nearest neighbors - concepts

Using KNN to predict a rating for a movie

Dimensionality reduction and principal component analysis

A PCA example with the Iris dataset

Data warehousing overview

Reinforcement learning

Summary

Dealing with Real-World Data

Bias/variance trade-off

K-fold cross-validation to avoid overfitting

Data cleaning and normalisation

Cleaning web log data

Normalizing numerical data

Detecting outliers

Summary

Apache Spark - Machine Learning on Big Data

Installing Spark

Spark introduction

Spark and Resilient Distributed Datasets (RDD)

Introducing MLlib

Decision Trees in Spark with MLlib

K-Means Clustering in Spark

TF-IDF

Searching wikipedia with Spark MLlib

Using the Spark 2.0 DataFrame API for MLlib

Summary

Testing and Experimental Design

A/B testing concepts

T-test and p-value

Measuring t-statistics and p-values using Python

Determining how long to run an experiment for

A/B test gotchas

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing Enthought Canopy

Let's dive right in and get what you need installed to actually develop Python code with data science on your desktop. I'm going to walk you through installing a package called Enthought Canopy which has both the development environment and all the Python packages you need pre-installed. It makes life really easy, but if you already know Python you might have an existing Python environment already on your PC, and if you want to keep using it, maybe you can.

The most important thing is that your Python environment has Python 3.5 or newer, that it supports Jupyter Notebooks (because that's what we're going to use in this course), and that you have the key packages you need for this book installed on your environment. I'll explain exactly how to achieve a full installation in a few simple steps - it's going to be very easy.

Let's first overview those key packages, most of which Canopy will be installing for us automatically for us. Canopy will install Python 3.5 for us, and some further packages we need including: scikit_learn, xlrd, and statsmodels. We'll need to manually use the pip command, to install a package called pydot2plus. And that will be it - it's very easy with Canopy!

Once the following installation steps are complete, we'll have everything we need to actually get up and running, and so we'll open up a little sample file and do some data science for real. Now let's get you set up with everything you need to get started as quickly as possible:

The first thing you will need is a development environment, called an IDE, for Python code. What we're going to use for this book is Enthought Canopy. It's a scientific computing environment, and it's going to work well with this book:

To get Canopy installed, just go to www.enthought.com and click on DOWNLOADS: Canopy:

Enthought Canopy is free, for the Canopy Express edition - which is what you want for this book. You must then select your operating system and architecture. For me, that's Windows 64-bit, but you'll want to click on corresponding Download button for your operating system and with the Python 3.5 option:

We don't have to give them any personal information at this step. There's a pretty standard Windows installer, so just let that download:

After that's downloaded we go ahead and open up the Canopy installer, and run it! You might want to read the license before you agree to it, that's up to you, and then just wait for the installation to complete.
Once you hit the Finish button at the end of the install process, allow it to launch Canopy automatically. You'll see that Canopy then sets up the Python environment by itself, which is great, but this will take a minute or two.
Once the installer is done setting up your Python environment, you should get a screen that looks like the one below. It says welcome to Canopy and a bunch of big friendly buttons:

The beautiful thing is that pretty much everything you need for this book comes pre-installed with Enthought Canopy, that's why I recommend using it!
There is just one last thing we need to set up, so go ahead and click the Editor button there on the Canopy Welcome screen. You'll then see the Editor screen come up, and if you click down in the window at the bottom, I want you to just type in:

!pip install pydotplus

Here's how that's going to look on your screen as you type the above line in at the bottom of the Canopy Editor window; don't forget to press the Return button of course:

One you hit the Return button, this will install that one extra module that we need for later on in the book, when we get to talking about decision trees, and rendering decision trees.
Once it has finished installing pydotplus, it should come back and say it's successfully installed and, voila, you have everything you need now to get started! The installation is done, at this point - but let's just take a few more steps to confirm our installation is running nicely.

Giving the installation a test run

Let's now give your installation a test run. The first thing to do is actually to entirely close the Canopy window! This is because we're not actually going to be editing and using our code within this Canopy editor. Instead we're going to be using something called an IPython Notebook, which is also now known as the Jupyter Notebook.
Let me show you how that works. If you now open a window in your operating system to view the accompanying book files that you downloaded, as described in the Preface of this book. It should look something like this, with the set of .ipynb code files you downloaded for this book:

Now go down to the Outliers file in the list, that's the Outliers.ipynb file, double-click it, and what should happen is it's going to start up Canopy first and then it's going to kick off your web browser! This is because IPython/Jupyter Notebooks actually live within your web browser. There can be a small pause at first, and it can be a little bit confusing first time, but you'll soon get used to the idea.

You should soon see Canopy come up and for me my default web browser Chrome comes up. You should see the following Jupyter Notebook page, since we double-clicked on the Outliers.ipynb file:

If you see this screen, it means that everything's working great in your installation and you're all set for the journey across rest of this book!

If you occasionally get problems opening your IPNYB files

Just occasionally, I've noticed that things can go a little bit wrong when you double-click on a .ipynb file. Don't panic! Just sometimes, Canopy can get a little bit flaky, and you might see a screen that is looking for some password or token, or you might occasionally see a screen that says it can't connect at all.

Don't panic if either of those things happen to you, they are just random quirks, sometimes things just don't start up in the right order or they don't start up in time on your PC and it's okay.

All you have to do is go back and try to open that file a second time. Sometimes it takes two or three tries to actually get it loaded up properly, but if you do it a couple of times it should pop up eventually, and a Jupyter Notebook screen like the one we saw previously about Dealing with Outliers, is what you should see.

Hands-On Data Science and Python Machine Learning

By : Frank Kane

Hands-On Data Science and Python Machine Learning

By: Frank Kane

Overview of this book

Related Content you might be interested in

Current Title:

Hands-On Data Science and Python Machine Learning

Frank Kane's Taming Big Data with Apache Spark and Python

Hands-On Recommendation Systems with Python

Getting Started with Haskell Data Analysis