Hands-On Data Science and Python Machine Learning

By : Frank Kane

Hands-On Data Science and Python Machine Learning

By: Frank Kane

Overview of this book

Join Frank Kane, who worked on Amazon and IMDb’s machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them. Based on Frank’s successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis.

Preface

Free Chapter

Getting Started

Installing Enthought Canopy

Using and understanding IPython (Jupyter) Notebooks

Python basics - Part 1

Understanding Python code

Importing modules

Python basics - Part 2

Running Python scripts

Summary

Statistics and Probability Refresher, and Python Practice

Types of data

Mean, median, and mode

Using mean, median, and mode in Python

Standard deviation and variance

Probability density function and probability mass function

Types of data distributions

Percentiles and moments

Summary

Matplotlib and Advanced Probability Concepts

A crash course in Matplotlib

Covariance and correlation

Conditional probability

Bayes' theorem

Summary

Predictive Models

Linear regression

Polynomial regression

Multivariate regression and predicting car prices

Multi-level models

Summary

Machine Learning with Python

Machine learning and train/test

Using train/test to prevent overfitting of a polynomial regression

Bayesian methods - Concepts

Implementing a spam classifier with Naïve Bayes

K-Means clustering

Clustering people based on income and age

Measuring entropy

Decision trees - Concepts

Decision trees - Predicting hiring decisions using Python

Ensemble learning

Support vector machine overview

Using SVM to cluster people by using scikit-learn

Summary

Recommender Systems

What are recommender systems?

Item-based collaborative filtering

How item-based collaborative filtering works?

Finding movie similarities

Improving the results of movie similarities

Making movie recommendations to people

Improving the recommendation results

Summary

More Data Mining and Machine Learning Techniques

K-nearest neighbors - concepts

Using KNN to predict a rating for a movie

Dimensionality reduction and principal component analysis

A PCA example with the Iris dataset

Data warehousing overview

Reinforcement learning

Summary

Dealing with Real-World Data

Bias/variance trade-off

K-fold cross-validation to avoid overfitting

Data cleaning and normalisation

Cleaning web log data

Normalizing numerical data

Detecting outliers

Summary

Apache Spark - Machine Learning on Big Data

Installing Spark

Spark introduction

Spark and Resilient Distributed Datasets (RDD)

Introducing MLlib

Decision Trees in Spark with MLlib

K-Means Clustering in Spark

TF-IDF

Searching wikipedia with Spark MLlib

Using the Spark 2.0 DataFrame API for MLlib

Summary

Testing and Experimental Design

A/B testing concepts

T-test and p-value

Measuring t-statistics and p-values using Python

Determining how long to run an experiment for

A/B test gotchas

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Importing modules

Python itself, like any language, is fairly limited in what it can do. The real power of using Python for machine learning and data mining and data science is the power of all the external libraries that are available for it for that purpose. One of those libraries is called NumPy, or numeric Python, and, for example, here we can import the Numpy package, which is included with Canopy as np.

This means that I'll refer to the NumPy package as np, and I could call that anything I want. I could call it Fred or Tim, but it's best to stick with something that actually makes sense; now that I'm calling that NumPy package np, I can refer to it using np:

import numpy as np

In this example, I'll call the random function that's provided as part of the NumPy package and call its normal function to actually generate a normal distribution of random numbers using these parameters and print them out. Since it is random, I should get different results every time:

import numpy as np
A = np.random.normal(25.0, 5.0, 10)
print (A)

The output should look like this:

Sure enough, I get different results. That's pretty cool.

Data structures

Let's move on to data structures. If you need to pause and let things sink in a little bit, or you want to play around with these a little bit more, feel free to do so. The best way to learn this stuff is to dive in and actually experiment, so I definitely encourage doing that, and that's why I'm giving you working IPython/Jupyter Notebooks, so you can actually go in, mess with the code, do different stuff with it.

For example, here we have a distribution around 25.0, but let's make it around 55.0:

import numpy as np
A = np.random.normal(55.0, 5.0, 10)
print (A)

Hey, all my numbers changed, they're closer to 55 now, how about that?

Alright, let's talk about data structures a little bit here. As we saw in our first example, you can have a list, and the syntax looks like this.

Experimenting with lists

x = [1, 2, 3, 4, 5, 6]
print (len(x))

You can say, call a list x, for example, and assign it to the numbers 1 through 6, and these square brackets indicate that we are using a Python list, and those are immutable objects that I can actually add things to and rearrange as much as I want to. There's a built-in function for determining the length of the list called len, and if I type in len(x), that will give me back the number 6 because there are 6 numbers in my list.

Just to make sure, and again to drive home the point that this is actually running real code here, let's add another number in there, such as 4545. If you run this, you'll get 7 because now there are 7 numbers in that list:

x = [1, 2, 3, 4, 5, 6, 4545]
print (len(x))

The output of the previous code example is as follows:

Go back to the original example there. Now you can also slice lists. If you want to take a subset of a list, there's a very simple syntax for doing so:

x[3:]

The output of the above code example is as follows:

[1, 2, 3]

Pre colon

If, for example, you want to take the first three elements of a list, everything before element number 3, we can say :3 to get the first three elements, 1, 2, and 3, and if you think about what's going on there, as far as indices go, like in most languages, we start counting from 0. So element 0 is 1, element 1 is 2, and element 2 is 3. Since we're saying we want everything before element 3, that's what we're getting.

So, you know, never forget that in most languages, you start counting at 0 and not 1.

Now this can confuse matters, but in this case, it does make intuitive sense. You can think of that colon as meaning I want everything, I want the first three elements, and I could change that to four just again to make the point that we're actually doing something real here:

x[:4]

The output of the above code example is as follows:

[1, 2, 3, 4]

Post colon

Now if I put the colon on the other side of the 3, that says I want everything after 3, so 3 and after. If I say x[3:], that's giving me the third element, 0, 1, 2, 3, and everything after it. So that's going to return 4, 5, and 6 in that example, OK?

x[3:]

The output is as follows:

[4, 5, 6]

You might want to keep this IPython/Jupyter Notebook file around. It's a good reference, because sometimes it can get confusing as to whether the slicing operator includes that element or if it's up to or including it or not. So the best way is to just play around with it here and remind yourself.

Negative syntax

One more thing you can do is have this negative syntax:

x[-2:]

The output is as follows:

[5, 6]

By saying x[-2:], this means that I want the last two elements in the list. This means that go backwards two from the end, and that will give me 5 and 6, because those are the last two things on my list.

Adding list to list

You can also change lists around. Let's say I want to add a list to the list. I can use the extend function for that, as shown in the following code block:

x.extend([7,8])
x

The output of the above code is as follows:

[1, 2, 3, 4, 5, 6, 7, 8]

I have my list of 1, 2, 3, 4, 5, 6. If I want to extend it, I can say I have a new list here, [7, 8], and that bracket indicates this is a new list of itself. This could be a list implicit, you know, that's inline there, it could be referred to by another variable. You can see that once I do that, the new list I get actually has that list of 7, 8 appended on to the end of it. So I have a new list by extending that list with another list.

The append function

If you want to just add one more thing to that list, you can use the append function. So I just want to stick the number 9 at the end, there we go:

x.append(9)
x

The output of the above code is as follows:

[1, 2, 3, 4, 5, 6, 7, 8, 9]

Complex data structures

You can also have complex data structures with lists. So you don't have to just put numbers in it; you can actually put strings in it. You can put numbers in it. You can put other lists in it. It doesn't matter. Python is a weakly-typed language, so you can pretty much put whatever kind of data you want, wherever you want, and it will generally be an OK thing to do:

y = [10, 11, 12]
listOfLists = [x, y]
listOfLists

In the preceding example, I have a second list that contains 10, 11, 12, that I'm calling y. I'll create a new list that contains two lists. How's that for mind blowing? Our listofLists list will contain the x list and the y list, and that's a perfectly valid thing to do. You can see here that we have a bracket indicating the listofLists list, and within that, we have another set of brackets indicating each individual list that is in that list:

[[ 1, 2, 3, 4, 5, 6, 7, 8, 9 ], [10, 11, 12]]

So, sometimes things like these will come in handy.

Dereferencing a single element

If you want to dereference a single element of the list you can just use the bracket like that:

y[1]

The output of the above code is as follows:

So y[1] will return element 1. Remember that y had 10, 11, 12 in it - observe the previous example, and we start counting from 0, so element 1 will actually be the second element in the list, or the number 11 in this case, alright?

The sort function

Finally, let's have a built-in sort function that you can use:

z = [3, 2, 1]
z.sort()
z

So if I start with list z, which is 3,2, and 1, I can call sort on that list, and z will now be sorted in order. The output of the above code is as follows:

[1, 2, 3]

Reverse sort

z.sort(reverse=True)
z

The output of the above code is as follows:

[3, 2, 1]

If you need to do a reverse sort, you can just say reverse=True as an attribute, as a parameter in that sort function, and that will put it back to 3, 2, 1.

If you need to let that sink in a little bit, feel free to go back and read it a little bit more.

Tuples

Tuples are just like lists, except they're immutable, so you can't actually extend, append, or sort them. They are what they are, and they behave just like lists, apart from the fact that you can't change them, and you indicate that they are immutable and are tuple, as opposed to a list, using parentheses instead of a square bracket. So you can see they work pretty much the same way otherwise:

#Tuples are just immutable lists. Use () instead of []
x = (1, 2, 3)
len(x)

The output of the previous code is as follows:

We can say x= (1, 2, 3). I can still use length - len on that to say that there are three elements in that tuple, and even though, if you're not familiar with the term tuple, a tuple can actually contain as many elements as you want. Even though it sounds like it's Latin based on the number three, it doesn't mean you have three things in it. Usually, it only has two things in it. They can have as many as you want, really.

Dereferencing an element

We can also dereference the elements of a tuple, so element number 2 again would be the third element, because we start counting from 0, and that will give me back the number 6 in the following screenshot:

y = (4, 5, 6)
y[2]

The output to the above code is as follows:

List of tuples

We can also, like we could with lists, use tuples as elements of a list.

listOfTuples = [x, y]
listOfTuples

The output to the above code is as follows:

[(1, 2, 3), (4, 5, 6)]

We can create a new list that contains two tuples. So in the preceding example, we have our x tuple of (1, 2, 3) and our y tuple of (4, 5, 6); then we make a list of those two tuples and we get back this structure, where we have square brackets indicating a list that contains two tuples indicated by parentheses, and one thing that tuples are commonly used for when we're doing data science or any sort of managing or processing of data really is to use it to assign variables to input data as it's read in. I want to walk you through a little bit on what's going on in the following example:

(age, income) = "32,120000".split(',')
print (age)
print (income)

The output to the above code is as follows:

32
120000

Let's say we have a line of input data coming in and it's a comma-separated value file, which contains ages, say 32, comma-delimited by an income, say 120000 for that age, just to make something up. What I can do is as each line comes in, I can call the split function on it to actually separate that into a pair of values that are delimited by commas, and take that resulting tuple that comes out of split and assign it to two variables-age and income-all at once by defining a tuple of age, income and saying that I want to set that equal to the tuple that comes out of the split function.

So this is basically a common shorthand you'll see for assigning multiple fields to multiple variables at once. If I run that, you can see that the age variable actually ends up assigned to 32 and income to 120,000 because of that little trick there. You do need to be careful when you're doing this sort of thing, because if you don't have the expected number of fields or the expected number of elements in the resulting tuple, you will get an exception if you try to assign more stuff or less stuff than you expect to see here.

Dictionaries

Finally, the last data structure that we'll see a lot in Python is a dictionary, and you can think of that as a map or a hash table in other languages. It's a way to basically have a sort of mini-database, sort of a key/value data store that's built into Python. So let's say, I want to build up a little dictionary of Star Trek ships and their captains:

I can set up a captains = {}, where curly brackets indicates an empty dictionary. Now I can use this sort of a syntax to assign entries in my dictionary, so I can say captains for Enterprise is Kirk, for Enterprise D it is Picard, for Deep Space Nine it is Sisko, and for Voyager it is Janeway. Now I have, basically, this lookup table that will associate ship names with their captain, and I can say, for example, print captains["Voyager"], and I get back Janeway.

A very useful tool for basically doing lookups of some sort. Let's say you have some sort of an identifier in a dataset that maps to some human-readable name. You'll probably be using a dictionary to actually do that look up when you're printing it out.

We can also see what happens if you try to look up something that doesn't exist. Well, we can use the get function on a dictionary to safely return an entry. So in this case, Enterprise does have an entry in my dictionary, it just gives me back Kirk, but if I call the NX-01 ship on the dictionary, I never defined the captain of that, so it comes back with a None value in this example, which is better than throwing an exception, but you do need to be aware that this is a possibility:

print (captains.get("NX-01"))

The output of the above code is as follows:

None

The captain is Jonathan Archer, but you know, I'm get a little bit too geeky here now.

Iterating through entries

for ship in captains:
     print (ship + ": " + captains[ship])

The output of the above code is as follows:

Let's look at a little example of iterating through the entries in a dictionary. If I want to iterate through every ship that I have in my dictionary and print out captains, I can type for ship in captains, and this will iterate through every single key in my dictionary. Then I can print out the lookup value of each ship's captain, and that's the output that I get there.

There you have it. This is basically the main data structures that you'll encounter in Python. There are some others, such as sets, but we'll not really use them in this book, so I think that's enough to get you started. Let's dive into some more Python nuances in our next section.

Hands-On Data Science and Python Machine Learning

By : Frank Kane

Hands-On Data Science and Python Machine Learning

By: Frank Kane

Overview of this book

Related Content you might be interested in

Current Title:

Hands-On Data Science and Python Machine Learning

Frank Kane's Taming Big Data with Apache Spark and Python

Hands-On Recommendation Systems with Python

Getting Started with Haskell Data Analysis