Book Image

scikit-learn Cookbook - Second Edition

By : Trent Hauck
Book Image

scikit-learn Cookbook - Second Edition

By: Trent Hauck

Overview of this book

Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. This book includes walk throughs and solutions to the common as well as the not-so-common problems in machine learning, and how scikit-learn can be leveraged to perform various machine learning tasks effectively. The second edition begins with taking you through recipes on evaluating the statistical properties of data and generates synthetic data for machine learning modelling. As you progress through the chapters, you will comes across recipes that will teach you to implement techniques like data pre-processing, linear regression, logistic regression, K-NN, Naïve Bayes, classification, decision trees, Ensembles and much more. Furthermore, you’ll learn to optimize your models with multi-class classification, cross validation, model evaluation and dive deeper in to implementing deep learning with scikit-learn. Along with covering the enhanced features on model section, API and new features like classifiers, regressors and estimators the book also contains recipes on evaluating and fine-tuning the performance of your model. By the end of this book, you will have explored plethora of features offered by scikit-learn for Python to solve any machine learning problem you come across.
Table of Contents (13 chapters)

Viewing the iris dataset with Pandas

In this recipe we will use the handy pandas data analysis library to view and visualize the iris dataset. It contains the notion o, a dataframe which might be familiar to you if you use the language R's dataframe.

How to do it...

You can view the iris dataset with Pandas, a library built on top of NumPy:

  1. Create a dataframe with the observation variables iris.data, and column names columns, as arguments:
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)

The dataframe is more user-friendly than the NumPy array.

  1. Look at a quick histogram of the values in the dataframe for sepal length:
iris_df['sepal length (cm)'].hist(bins=30)
  1. You can also color the histogram by the target variable:
for class_number in np.unique(iris.target):
plt.figure(1)
iris_df['sepal length (cm)'].iloc[np.where(iris.target == class_number)[0]].hist(bins=30)
  1. Here, iterate through the target numbers for each flower and draw a color histogram for each. Consider this line:
np.where(iris.target== class_number)[0]

It finds the NumPy index location for each class of flower:

Observe that the histograms overlap. This encourages us to model the three histograms as three normal distributions. This is possible in a machine learning manner if we model the training data only as three normal distributions, not the whole set. Then we use the test set to test the three normal distribution models we just made up. Finally, we test the accuracy of our predictions on the test set.

How it works...

The dataframe data object is a 2D NumPy array with column names and row names. In data science, the fundamental data object looks like a 2D table, possibly because of SQL's long history. NumPy allows for 3D arrays, cubes, 4D arrays, and so on. These also come up often.