Book Image

Python Machine Learning Blueprints - Second Edition

By : Alexander Combs, Michael Roman
Book Image

Python Machine Learning Blueprints - Second Edition

By: Alexander Combs, Michael Roman

Overview of this book

Machine learning is transforming the way we understand and interact with the world around us. This book is the perfect guide for you to put your knowledge and skills into practice and use the Python ecosystem to cover key domains in machine learning. This second edition covers a range of libraries from the Python ecosystem, including TensorFlow and Keras, to help you implement real-world machine learning projects. The book begins by giving you an overview of machine learning with Python. With the help of complex datasets and optimized techniques, you’ll go on to understand how to apply advanced concepts and popular machine learning algorithms to real-world projects. Next, you’ll cover projects from domains such as predictive analytics to analyze the stock market and recommendation systems for GitHub repositories. In addition to this, you’ll also work on projects from the NLP domain to create a custom news feed using frameworks such as scikit-learn, TensorFlow, and Keras. Following this, you’ll learn how to build an advanced chatbot, and scale things up using PySpark. In the concluding chapters, you can look forward to exciting insights into deep learning and you'll even create an application using computer vision and neural networks. By the end of this book, you’ll be able to analyze data seamlessly and make a powerful impact through your projects.
Table of Contents (13 chapters)

Python libraries and functions for each stage of the data science workflow

Now that you have an understanding of each step in the data science workflow, we'll take a look at a selection of useful Python libraries and functions within those libraries for each step.


Since one of the more common ways to access data is through a RESTful API, one library that you'll want to be aware of is the Python Requests library, Dubbed HTTP for humans, it makes interacting with APIs a clean and simple experience.

Let's take a look at a sample interaction, using requests to pull down data from GitHub's API. Here, we will make a call to the API and request a list of starred repositories for a user:

import requests r = requests.get(r"") r.json() 

This will return a JSON of all the repositories the user has starred, along with attributes about each. Here is a snippet of the output for the preceding call:

Output snippet when we return a JSON of all the repositories

The requests library has an amazing number of features—far too many to cover here, but I do suggest you check out the documentation.


Because inspecting your data is such a critical step in the development of machine learning applications, we'll now take an in-depth look at several libraries that will serve you well in this task.

The Jupyter Notebook

There are a number of libraries that will make the data inspection process easier. The first is Jupyter Notebook with IPython ( This is a fully-fledged, interactive computing environment, and it is ideal for data exploration. Unlike most development environments, Jupyter Notebook is a web-based frontend (to the IPython kernel) that is divided into individual code blocks or cells. Cells can be run individually or all at once, depending on the need. This allows the developer to run a scenario, see the output, then step back through the code, make adjustments, and see the resulting changes—all without leaving the notebook. Here is a sample interaction in the Jupyter Notebook:

Sample interaction in the Jupyter Notebook

You will notice that we have done a number of things here and have interacted with not only the IPython backend, but the terminal shell as well. Here, I have imported the Python os library and made a call to find the current working directory (cell #2), which you can see is the output below my input code cell. I then changed directories using the os library in cell #3, but stopped utilizing the os library and began using Linux-based commands in cell #4. This is done by adding the ! prepend to the cell. In cell #6, you can see that I was even able to save the shell output to a Python variable (file_two). This is a great feature that makes file operations a simple task.

Note that the results would obviously differ slightly on your machine, since this displays information on the user under which it runs.

Now, let's take a look at some simple data operations using the notebook. This will also be our first introduction to another indispensable library, pandas.


Pandas is a remarkable tool for data analysis that aims to be the most powerful and flexible open source data analysis/manipulation tool available in any language. And, as you will soon see, if it doesn't already live up to this claim, it can't be too far off. Let's now take a look:

Importing the iris dataset

You can see from the preceding screenshot that I have imported a classic machine learning dataset, the iris dataset (also available at, using scikit-learn, a library we'll examine in detail later. I then passed the data into a pandas DataFrame, making sure to assign the column headers. One DataFrame contains flower measurement data, and the other DataFrame contains a number that represents the iris species. This is coded 0, 1, and 2 for setosa, versicolor, and virginica respectively. I then concatenated the two DataFrames.

For working with datasets that will fit on a single machine, pandas is the ultimate tool; you can think of it a bit like Excel on steroids. And, like the popular spreadsheet program, the basic units of operation are columns and rows of data that form tables. In the terminology of pandas, columns of data are series and the table is a DataFrame.

Using the same iris DataFrame we loaded previously, let's now take a look at a few common operations, including the following:

The first action was just to use the .head() command to get the first five rows. The second command was to select a single column from the DataFrame by referencing it by its column name. Another way we perform this data slicing is to use the .iloc[row,column] or .loc[row,column] notation. The former slices data using a numeric index for the columns and rows (positional indexing), while the latter uses a numeric index for the rows, but allows for using named columns (label-based indexing).

Let's select the first two columns and the first four rows using the .iloc notation. We'll then look at the .loc notation:

Using the .iloc notation and the Python list slicing syntax, we were able to select a slice of this DataFrame.

Now, let's try something more advanced. We'll use a list iterator to select just the width feature columns:

What we have done here is create a list that is a subset of all columns. df.columns returns a list of all columns, and our iteration uses a conditional statement to select only those with width in the title. Obviously, in this situation, we could have just as easily typed out the columns we wanted into a list, but this gives you a sense of the power available when dealing with much larger datasets.

We've seen how to select slices based on their position within the DataFrame, but let's now look at another method to select data. This time, we will select a subset of the data based upon satisfying conditions that we specify:

  1. Let's now see the unique list of species available, and select just one of those:
  1. In the far-right column, you will notice that our DataFrame only contains data for the Iris-virginica species (represented by the 2) now. In fact, the size of the DataFrame is now 50 rows, down from the original 150 rows:
  1. You can also see that the index on the left retains the original row numbers. If we wanted to save just this data, we could save it as a new DataFrame, and reset the index as shown in the following diagram:

  1. We have selected data by placing a condition on one column; let's now add more conditions. We'll go back to our original DataFrame and add two conditions:

The DataFrame now only includes data from the virginica species with a petal width greater than 2.2.

Let's now move on to using pandas to get some quick descriptive statistics from our iris dataset:

With a call to the .describe() function, I have received a breakdown of the descriptive statistics for each of the relevant columns. (Notice that species was automatically removed as it is not relevant for this.) I could also pass in my own percentiles if I wanted more granular information:

Next, let's check whether there is any correlation between these features. That can be done by calling .corr() on our DataFrame:

The default returns the Pearson correlation coefficient for each row-column pair. This can be switched to Kendall's Tau or Spearman's rank correlation coefficient by passing in a method argument (for example, .corr(method="spearman") or .corr(method="kendall")).


So far, we have seen how to select portions of a DataFrame and how to get summary statistics from our data, but let's now move on to learning how to visually inspect the data. But first, why even bother with visual inspection? Let's see an example to understand why.

Here is the summary statistics for four distinct series of x and y values:

Series of x and y


Mean of x


Mean of y


Sample variance of x


Sample variance of y


Correlation between x and y


Regression line

y = 3.00 + 0.500x

Based on the series having identical summary statistics, you might assume that these series would appear visually similar. You would, of course, be wrong. Very wrong. The four series are part of Anscombe's quartet, and they were deliberately created to illustrate the importance of visual data inspection. Each series is plotted as follows:

Clearly, we would not treat these datasets as identical after having visualized them. So, now that we understand the importance of visualization, let's take a look at a pair of useful Python libraries for this.

The matplotlib library

The first library we'll take a look at is matplotlib. The matplotlib library is the center of the Python plotting library universe. Originally created to emulate the plotting functionality of MATLAB, it grew into a fully-featured library in its own right with an enormous range of functionality. If you have not come from a MATLAB background, it can be hard to understand how all the pieces work together to create the graphs you see. I'll do my best to break down the pieces into logical components so you can get up to speed quickly. But before diving into matplotlib in full, let's set up our Jupyter Notebook to allow us to see our graphs inline. To do this, add the following lines to your import statements:

import matplotlib.pyplot as plt'ggplot') 
%matplotlib inline 

The first line imports matplotlib, the second line sets the styling to approximate R's ggplot library (requires matplotlib 1.41 or greater), and the last line sets the plots so that they are visible within the notebook.

Now, let's generate our first graph using our iris dataset:

fig, ax = plt.subplots(figsize=(6,4)) 
ax.hist(df['petal width (cm)'], color='black'); 
ax.set_ylabel('Count', fontsize=12) 
ax.set_xlabel('Width', fontsize=12) 
plt.title('Iris Petal Width', fontsize=14, y=1.01) 

The preceding code generates the following output:

There is a lot going on even in this simple example, but we'll break it down line by line. The first line creates a single subplot with a width of 6 inches and a height of 4 inches. We then plot a histogram of the petal width from our iris DataFrame by calling .hist() and passing in our data. We also set the bar color to black here. The next two lines place labels on our y and x axes, respectively, and the final line sets the title for our graph. We tweak the title's y position relative to the top of the graph with the y parameter, and increase the font size slightly over the default. This gives us a nice histogram of our petal width data. Let's now expand on that, and generate histograms for each column of our iris dataset:

fig, ax = plt.subplots(2,2, figsize=(6,4)) 
ax[0][0].hist(df['petal width (cm)'], color='black'); 
ax[0][0].set_ylabel('Count', fontsize=12) 
ax[0][0].set_xlabel('Width', fontsize=12) 
ax[0][0].set_title('Iris Petal Width', fontsize=14, y=1.01) 
ax[0][1].hist(df['petal length (cm)'], color='black'); 
ax[0][1].set_ylabel('Count', fontsize=12) 
ax[0][1].set_xlabel('Length', fontsize=12) 
ax[0][1].set_title('Iris Petal Length', fontsize=14, y=1.01) 
ax[1][0].hist(df['sepal width (cm)'], color='black'); 
ax[1][0].set_ylabel('Count', fontsize=12) 
ax[1][0].set_xlabel('Width', fontsize=12) 
ax[1][0].set_title('Iris Sepal Width', fontsize=14, y=1.01) 
ax[1][1].hist(df['sepal length (cm)'], color='black'); 
ax[1][1].set_ylabel('Count', fontsize=12) 
ax[1][1].set_xlabel('Length', fontsize=12) 
ax[1][1].set_title('Iris Sepal Length', fontsize=14, y=1.01) 

The output for the preceding code is shown in the following diagram:

Obviously, this is not the most efficient way to code this, but it is useful for demonstrating how matplotlib works. Notice that instead of the single subplot object, ax, as we had in the first example, we now have four subplots, which are accessed through what is now the ax array. A new addition to the code is the call to plt.tight_layout(); this function will nicely auto-space your subplots to avoid crowding.

Let's now take a look at a few other types of plots available in matplotlib. One useful plot is a scatterplot. Here, we will plot the petal width against the petal length:

fig, ax = plt.subplots(figsize=(6,6)) 
ax.scatter(df['petal width (cm)'],df['petal length (cm)'],                      color='green') 
ax.set_xlabel('Petal Width') 
ax.set_ylabel('Petal Length') 
ax.set_title('Petal Scatterplot') 

The preceding code generates the following output:

As before, we could add in multiple subplots to examine each facet.

Another plot we could examine is a simple line plot. Here, we will look at a plot of the petal length:

fig, ax = plt.subplots(figsize=(6,6)) 
ax.plot(df['petal length (cm)'], color='blue') 
ax.set_xlabel('Specimen Number') 
ax.set_ylabel('Petal Length') 
ax.set_title('Petal Length Plot') 

The preceding code generates the following output:

We can already begin to see, based on this simple line plot, that there are distinctive clusters of lengths for each species—remember our sample dataset had 50 ordered examples of each type. This tells us that petal length is likely to be a useful feature to discriminate between the species if we were to build a classifier.

Let's look at one final type of chart from the matplotlib library, the bar chart. This is perhaps one of the more common charts you'll see. Here, we'll plot a bar chart for the mean of each feature for the three species of irises, and to make it more interesting, we'll make it a stacked bar chart with a number of additional matplotlib features:

import numpy as np
fig, ax = plt.subplots(figsize=(6,6))
bar_width = .8
labels = [x for x in df.columns if 'length' in x or 'width' in x]
set_y = [df[df['species']==0][x].mean() for x in labels]
ver_y = [df[df['species']==1][x].mean() for x in labels]
vir_y = [df[df['species']==2][x].mean() for x in labels]
x = np.arange(len(labels)), set_y, bar_width, color='black'), ver_y, bar_width, bottom=set_y, color='darkgrey'), vir_y, bar_width, bottom=[i+j for i,j in zip(set_y, ver_y)], color='white')
ax.set_xticks(x + (bar_width/2))
ax.set_xticklabels(labels, rotation=-70, fontsize=12);
ax.set_title('Mean Feature Measurement By Species', y=1.01)

The output for the preceding snippet is given here:

To generate the bar chart, we need to pass the x and y values into the .bar() function. In this case, the x values will just be an array of the length of the features we are interested in—four here, or one for each column in our DataFrame. The np.arange() function is an easy way to generate this, but we could nearly as easily input this array manually. Since we don't want the x axis to display this as 1 through 4, we call the .set_xticklabels() function and pass in the column names we wish to display. To line up the x labels properly, we also need to adjust the spacing of the labels. This is why we set the xticks to x plus half the size of the bar_width, which we also set earlier at 0.8. The y values come from taking the mean of each feature for each species. We then plot each by calling .bar(). It is important to note that we pass in a bottom parameter for each series, which sets the minimum y point and the maximum y point of the series below it. This creates the stacked bars. And finally, we add a legend, which describes each series. The names are inserted into the legend list in order of the placement of the bars from top to bottom.

The seaborn library

The next visualization library we'll look at is called seaborn, ( It is a library that was created specifically for statistical visualizations. In fact, it is perfect for use with pandas DataFrames, where the columns are features and the rows are observations. This style of DataFrame is called tidy data, and is the most common form for machine learning applications.

Let's now take a look at the power of seaborn:

import seaborn as sns 
sns.pairplot(df, hue='species') 

With just those two lines of code, we get the following:

Seaborn plot

Having just detailed the intricate nuances of matplotlib, you will immediately appreciate the simplicity with which we generated this plot. All of our features have been plotted against each other and properly labeled with just two lines of code. You might wonder if I just wasted dozens of pages teaching you matplotlib when seaborn makes these types of visualizations so simple. Well, that isn't the case, as seaborn is built on top of matplotlib. In fact, you can use all of what you learned about matplotlib to modify and work with seaborn. Let's take a look at another visualization:

fig, ax = plt.subplots(2, 2, figsize=(7, 7)) 
sns.set(style='white', palette='muted')
sns.violinplot(x=df['species'], y=df['sepal length (cm)'], ax=ax[0,0]) sns.violinplot(x=df['species'], y=df['sepal width (cm)'], ax=ax[0,1]) sns.violinplot(x=df['species'], y=df['petal length (cm)'], ax=ax[1,0]) sns.violinplot(x=df['species'], y=df['petal width (cm)'], ax=ax[1,1]) fig.suptitle('Violin Plots', fontsize=16, y=1.03)
for i in ax.flat:
plt.setp(i.get_xticklabels(), rotation=-90)

The preceding code generates the following output:

Violin Plots

Here, we have generated a violin plot for each of the four features. A violin plot displays the distribution of the features. For example, you can easily see that the petal length of setosa (0) is highly clustered between 1 cm and 2 cm, while virginica (2) is much more dispersed, from nearly 4 cm to over 7 cm. You will also notice that we have used much of the same code we used when constructing the matplotlib graphs. The main difference is the addition of the sns.plot() calls, in place of the ax.plot() calls previously. We have also added a title above all of the subplots, rather than over each individually, with the fig.suptitle() function. One other notable addition is the iteration over each of the subplots to change the rotation of the xticklabels. We call ax.flat() and then iterate over each subplot axis to set a particular property using .setp(). This prevents us from having to individually type out ax[0][0][1][1] and set the properties, as we did previously in the earlier matplotlib subplot code.

There are hundreds of styles of graphs you can generate using matplotlib and seaborn, and I highly recommend digging into the documentation for these two libraries—it will be time well spent—but the graphs I have detailed in the preceding section should go a long way toward helping you to understand the dataset you have, which in turn will help you when building your machine learning models.


We've learned a great deal about inspecting the data we have, but now let's move on to learning how to process and manipulate our data. Here, we will learn about the .map(), .apply(), .applymap(), and .groupby() functions of pandas. These are invaluable for working with data, and are especially useful in the context of machine learning for feature engineering, a concept we will discuss in detail in later chapters.


We'll now begin with the map function. The map function works on series, so in our case we will use it to transform a column of our DataFrame, which you will recall is just a pandas series. Suppose we decide that the species numbers are not suitable for our needs. We'll use the map function with a Python dictionary as the argument to accomplish this. We'll pass in a replacement for each of the unique iris types:

Let's look at what we have done here. We have run the map function over each of the values of the existing species column. As each value was found in the Python dictionary, it was added to the return series. We assigned this return series to the same species name, so it replaced our original species column. Had we chosen a different name, say short code, that column would have been appended to the DataFrame, and we would then have the original species column plus the new short code column.

We could have instead passed the map function a series or a function to perform this transformation on a column, but this is a functionality that is also available through the apply function, which we'll take a look at next. The dictionary functionality is unique to the map function, and the most common reason to choose map over apply for a single column transformation. But, let's now take a look at the apply function.


The apply function allows us to work with both DataFrames and series. We'll start with an example that would work equally well with map, before moving on to examples that would only work with apply.

Using our iris DataFrame, let's make a new column based on petal width. We previously saw that the mean for the petal width was 1.3. Let's now create a new column in our DataFrame, wide petal, that contains binary values based on the value in the petal width column. If the petal width is equal to or wider than the median, we will code it with a 1, and if it is less than the median, we will code it 0. We'll do this using the apply function on the petal width column:

A few things happened here, so let's walk through them step by step. The first is that we were able to append a new column to the DataFrame simply by using the column selection syntax for a column name, which we want to create, in this case wide petal. We set that new column equal to the output of the apply function. Here, we ran apply on the petal width column that returned the corresponding values in the wide petal column. The apply function works by running through each value of the petal width column. If the value is greater than or equal to 1.3, the function returns 1, otherwise it returns 0. This type of transformation is a fairly common feature engineering transformation in machine learning, so it is good to be familiar with how to perform it.

Let's now take a look at using apply on a DataFrame rather than a single series. We'll now create a feature based on the petal area:

Creating a new feature

Notice that we called apply not on a series here, but on the entire DataFrame, and because apply was called on the entire DataFrame, we passed in axis=1 in order to tell pandas that we want to apply the function row-wise. If we passed in axis=0, then the function would operate column-wise. Here, each column is processed sequentially, and we choose to multiply the values from the petal length (cm) and petal width (cm) columns. The resultant series then becomes the petal area column in our DataFrame. This type of power and flexibility is what makes pandas an indispensable tool for data manipulation.


We've looked at manipulating columns and explained how to work with rows, but suppose you'd like to perform a function across all data cells in your DataFrame. This is where applymap is the correct tool. Let's take a look at an example:

Using applymap function

Here, we called applymap on our DataFrame in order to get the log of every value (np.log() utilizes the NumPy library to return this value), if that value is of the float type. This type checking prevents returning an error or a float for the species or wide petal columns, which are string and integer values respectively. Common uses of applymap include transforming or formatting each cell based on meeting a number of conditional criteria.


Let's now look at an operation that is highly useful, but often difficult for new pandas users to get their heads around: the .groupby() function. We'll walk through a number of examples step by step in order to illustrate the most important functionality.

The groupby operation does exactly what it says: it groups data based on some class or classes you choose. Let's take a look at a simple example using our iris dataset. We'll go back and reimport our original iris dataset, and run our first groupby operation:

Here, data for each species is partitioned and the mean for each feature is provided. Let's take it a step further now and get full descriptive statistics for each species:

Statistics for each species

And now, we can see the full breakdown bucketed by species. Let's now look at some other groupby operations we can perform. We saw previously that petal length and width had some relatively clear boundaries between species. Now, let's examine how we might use groupby to see that:

In this case, we have grouped each unique species by the petal width they were associated with. This is a manageable number of measurements to group by, but if it were to become much larger, we would likely need to partition the measurements into brackets. As we saw previously, that can be accomplished by means of the apply function.

Let's now take a look at a custom aggregation function:

In this code, we grouped petal width by species using the .max() and .min() functions, and a lambda function that returns a maximum petal width less than the minimum petal width.

We've only just touched on the functionality of the groupby function; there is a lot more to learn, so I encourage you to read the documentation available at

Hopefully, you now have a solid base-level understanding of how to manipulate and prepare data in preparation for our next step, which is modeling. We will now move on to discuss the primary libraries in the Python machine learning ecosystem.

Modeling and evaluation

In this section ,we will go through different libraries such as statsmodels and Scikit-learn and also understand what is deployment.


The first library we'll cover is the statsmodels library ( Statsmodels is a Python package that is well documented and developed for exploring data, estimating models, and running statistical tests. Let's use it here to build a simple linear regression model of the relationship between sepal length and sepal width for the setosa species.

First, let's visually inspect the relationship with a scatterplot:

fig, ax = plt.subplots(figsize=(7,7)) 
ax.scatter(df['sepal width (cm)'][:50], df['sepal length (cm)'][:50]) 
ax.set_ylabel('Sepal Length') 
ax.set_xlabel('Sepal Width') 
ax.set_title('Setosa Sepal Width vs. Sepal Length', fontsize=14, y=1.02) 

The preceding code generates the following output:

So, we can see that there appears to be a positive linear relationship; that is, as the sepal width increases, the sepal length does as well. We'll next run a linear regression on the data using statsmodels to estimate the strength of that relationship:

import statsmodels.api as sm 
y = df['sepal length'][:50] 
x = df['sepal width'][:50] 
X = sm.add_constant(x) 
results = sm.OLS(y, X).fit() 
print results.summary() 

The preceding code generates the following output:

In the preceding diagram, we have the results of our simple regression model. Since this is a linear regression, the model takes the format of Y = Β0+ Β1X, where B0 is the intercept and B1 is the regression coefficient. Here, the formula would be Sepal Length = 2.6447 + 0.6909 * Sepal Width. We can also see that the R2 for the model is a respectable 0.558, and the p-value, (Prob), is highly significant—at least for this species.

Let's now use the results object to plot our regression line:

fig, ax = plt.subplots(figsize=(7,7)) 
ax.plot(x, results.fittedvalues, label='regression line') 
ax.scatter(x, y, label='data point', color='r') 
ax.set_ylabel('Sepal Length') 
ax.set_xlabel('Sepal Width') 
ax.set_title('Setosa Sepal Width vs. Sepal Length', fontsize=14, y=1.02) 

The preceding code generates the following output:

By plotting results.fittedvalues, we can get the resulting regression line from our regression.

There are a number of other statistical functions and tests in the statsmodels package, and I invite you to explore them. It is an exceptionally useful package for standard statistical modeling in Python. Let's now move on to the king of Python machine learning packages: scikit-learn.


Scikit-learn is an amazing Python library with unrivaled documentation, designed to provide a consistent API to dozens of algorithms. It is built upon, and is itself, a core component of the Python scientific stack, which includes NumPy, SciPy, pandas, and matplotlib. Here are some of the areas scikit-learn covers: classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

We'll look at a few examples. First, we will build a classifier using our iris data, and then we'll look at how we can evaluate our model using the tools of scikit-learn:

  1. The first step to building a machine learning model in scikit-learn is understanding how the data must be structured.
  2. The independent variables should be a numeric n × m matrix, X, and the dependent variable, y, an n × 1 vector.
  3. The y vector may be either a numeric continuous or categorical, or a string categorical.
  4. These are then passed into the .fit() method on the chosen classifier.
  5. This is the great benefit of using scikit-learn: each classifier utilizes the same methods to the extent possible. This makes swapping them in and out a breeze.

Let's see this in action in our first example:

from sklearn.ensemble import RandomForestClassifier 
from sklearn.cross_validation import train_test_split 
clf = RandomForestClassifier(max_depth=5, n_estimators=10) 
X = df.ix[:,:4] 
y = df.ix[:,4] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3),y_train) 
y_pred = clf.predict(X_test) 
rf = pd.DataFrame(zip(y_pred, y_test), columns=['predicted', 'actual']) 
rf['correct'] = rf.apply(lambda r: 1 if r['predicted'] == r['actual'] else 0, axis=1) 

The preceding code generates the following output:

Now, let's execute the following line of code:


The preceding code generates the following output:

In the preceding few lines of code, we built, trained, and tested a classifier that has a 95% accuracy level on our iris dataset. Let's unpack each of the steps. Up at the top, we made a couple of imports; the first two are from scikit-learn, which thankfully is shortened to sklearn in import statements. The first import is a random forest classifier, and the second is a module for splitting your data into training and testing cohorts. This data partitioning is critical in building machine learning applications for a number of reasons. We'll get into this in later chapters, but suffice to say at this point it is a must. This train_test_split module also shuffles your data, which again is important as the order can contain information that would bias your actual predictions.

The first curious-looking line after the imports instantiates our classifier, in this case a random forest classifier. We select a forest that uses 10 decision tress, and each tree is allowed a maximum split depth of five. This is put in place to avoid overfitting, something we will discuss in depth in later chapters.

The next two lines create our X matrix and y vector. If you remember our original iris DataFrame, it contained four features: petal width and length, and sepal width and length. These features are selected and become our independent feature matrix, X. The last column, the iris class names, then becomes our dependent y vector.

These are then passed into the train_test_split method, which shuffles and partitions our data into four subsets, X_train, X_test, y_train, and y_test. The test_size parameter is set to .3, which means 30% of our dataset will be allocated to the X_test and y_test partitions, while the rest will be allocated to the training partitions, X_train and y_train.

Next, our model is fitted using the training data. Having trained the model, we then call the predict method on our classifier using our test data. Remember, the test data is data the classifier has not seen. The return of this prediction is a list of prediction labels. We then create a DataFrame of the actual labels versus the predicted labels. We finally total the correct predictions and divide by the total number of instances, which we can see gave us a very accurate prediction. Let's now see which features gave us the most discriminative or predictive power:

f_importances = clf.feature_importances_ 
f_names = df.columns[:4]
f_std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0)

zz = zip(f_importances, f_names, f_std)
zzs = sorted(zz, key=lambda x: x[0], reverse=True)

imps = [x[0] for x in zzs]
labels = [x[1] for x in zzs]
errs = [x[2] for x in zzs], imps, color="r", yerr=errs, align="center")
plt.xticks(range(len(f_importances)), labels);

The preceding code generates the following output:

As we expected, based upon our earlier visual analysis, the petal length and width have more discriminative power when differentiating between the iris classes. Where exactly did these numbers come from though? The random forest has a method called .feature_importances_ that returns the relative performance of the feature for splitting at the leaves. If a feature is able to consistently and cleanly split a group into distinct classes, it will have a high feature importance. This number will always total one. As you will notice here, we have included the standard deviation, which helps to illustrate how consistent each feature is. This is generated by taking the feature importance, for each of the features, for each ten trees, and calculating the standard deviation.

Let's now take a look at one more example using scikit-learn. We will now switch out our classifier and use a support vector machine (SVM):

from sklearn.multiclass import OneVsRestClassifier 
from sklearn.svm import SVC 
from sklearn.cross_validation import train_test_split 
clf = OneVsRestClassifier(SVC(kernel='linear')) 
X = df.ix[:,:4] 
y = np.array(df.ix[:,4]).astype(str) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3),y_train) 
y_pred = clf.predict(X_test) 
rf = pd.DataFrame(zip(y_pred, y_test), columns=['predicted', 'actual']) 
rf['correct'] = rf.apply(lambda r: 1 if r['predicted'] == r['actual'] else 0, axis=1) 

The preceding code generates the following output:

Now, let's execute the following line of code:


The preceding code generates the following output:

Here, we have swapped in an SVM without changing virtually any of our code. The only changes were the ones related to the importing of the SVM instead of the random forest, and the line that instantiates the classifier. (I did have to make one small change to the format of the y labels, as the SVM wasn't able to interpret them as NumPy strings like the random forest classifier was. Sometimes, these data type conversions have to be made specific or it will result in an error, but it's a minor annoyance.)

This is only a small sample of the functionality of scikit-learn, but it should give you a hint of the power of this magnificent tool for machine learning applications. There are a number of additional machine learning libraries we won't have a chance to discuss here but will explore in later chapters, but I strongly suggest that if this is your first time utilizing a machine learning library, and you want a strong general-purpose tool, scikit-learn is your go-to choice.


There are a number of options you can choose from when you decide to put your machine learning model into production. It depends substantially on the nature of the application. Deployment could include anything from a cron job run on your local machine to a full-scale implementation deployed on an Amazon EC2 instance.

We won't go into detail regarding specific implementations here, but we will have a chance to delve into different deployment examples throughout the book.