Book Image

Hands-On Data Science and Python Machine Learning

By : Frank Kane
Book Image

Hands-On Data Science and Python Machine Learning

By: Frank Kane

Overview of this book

Join Frank Kane, who worked on Amazon and IMDb’s machine learning algorithms, as he guides you on your first steps into the world of data science. Hands-On Data Science and Python Machine Learning gives you the tools that you need to understand and explore the core topics in the field, and the confidence and practice to build and analyze your own machine learning models. With the help of interesting and easy-to-follow practical examples, Frank Kane explains potentially complex topics such as Bayesian methods and K-means clustering in a way that anybody can understand them. Based on Frank’s successful data science course, Hands-On Data Science and Python Machine Learning empowers you to conduct data analysis and perform efficient machine learning using Python. Let Frank help you unearth the value in your data using the various data mining and data analysis techniques available in Python, and to develop efficient predictive models to predict future results. You will also learn how to perform large-scale machine learning on Big Data using Apache Spark. The book covers preparing your data for analysis, training machine learning models, and visualizing the final data analysis.
Table of Contents (11 chapters)

Using and understanding IPython (Jupyter) Notebooks

Congratulations on your installation! Let's now explore using Jupyter Notebooks, which is also known as IPython Notebook. These days, the more modern name is the Jupyter Notebook, but a lot of people still call it an IPython Notebook, and I consider the names interchangeable for working developers as a result. I do also find the name IPython Notebooks helps me remember the notebook file name suffix which is .ipynb as you'll get to know very well in this book!

Okay so now let's take it right from the top again - with our first exploration of the IPython/Jupyter Notebook. If you haven't yet done so, please navigate to the DataScience folder where we have downloaded all the materials for this book. For me, that's E:DataScience, and if you didn't do so during the preceding installation section, please now double-click and open up the Outliers.ipynb file.

Now what's going to happen when we double-click on this IPython .ipynb file is that first of all it's going to spark up Canopy, if it's not sparked up already, and then it's going to launch a web browser. This is how the full Outliers notebook webpage looks within my browser:

As you can see here, notebooks are structured in such a way that I can intersperse my little notes and commentary about what you're seeing here within the actual code itself, and you can actually run this code within your web browser! So, it's a very handy format for me to give you sort of a little reference that you can use later on in life to go and remind yourself how these algorithms work that we're going to talk about, and actually experiment with them and play with them yourself.

The way that the IPython/Jupyter Notebook files work is that they actually run from within your browser, like a webpage, but they're backed by the Python engine that you installed. So you should be seeing a screen similar to the one shown in the previous screenshot.

You'll notice as you scroll down the notebook in your browser, there are code blocks. They're easy to spot because they contain our actual code. Please find the code box for this code in the Outliers notebook, quite near the top:

%matplotlib inline 
import numpy as np 
 
incomes = np.random.normal(27000, 15000, 10000) 
incomes = np.append(incomes, [1000000000]) 
 
import matplotlib.pyplot as plt 
plt.hist(incomes, 50) 
plt.show() 

Let's take a quick look at this code while we're here. We are setting up a little income distribution in this code. We're simulating the distribution of income in a population of people, and to illustrate the effect that an outlier can have on that distribution, we're simulating Donald Trump entering the mix and messing up the mean value of the income distribution. By the way, I'm not making a political statement, this was all done before Trump became a politician. So you know, full disclosure there.

We can select any code block in the notebook by clicking on it. So if you now click in the code block that contains the code we just looked at above, we can then hit the run button at the top to run it. Here's the area at the top of the screen where you'll find the Run button:

Hitting the Run button with the code block selected, will cause this graph to be regenerated:

Similarly, we can click on the next code block a little further down, you'll spot the one which has the following single line of code :

incomes.mean() 

If you select the code block containing this line, and hit the Run button to run the code, you'll see the output below it, which ends up being a very large value because of the effect of that outlier, something like this:

127148.50796177129

Let's keep going and have some fun. In the next code block down, you'll see the following code, which tries to detect outliers like Donald Trump and remove them from the dataset:

def reject_outliers(data): 
    u = np.median(data) 
    s = np.std(data) 
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)] 
    return filtered 
 
filtered = reject_outliers(incomes) 
plt.hist(filtered, 50) 
plt.show() 

So select the corresponding code block in the notebook, and press the run button again. When you do that, you'll see this graph instead:

Now we see a much better histogram that represents the more typical American - now that we've taken out our outlier that was messing things up.

So, at this point, you have everything you need to get started in this course. We have all the data you need, all the scripts, and the development environment for Python and Python notebooks. So, let's rock and roll. Up next we're going to do a little crash course on Python itself, and even if you're familiar with Python, it might be a good little refresher so you might want to watch it regardless. Let's dive in and learn Python.