Machine Learning Fundamentals

Machine Learning Fundamentals

By : Hyatt Saleh

Buy this Book

Machine Learning Fundamentals

By: Hyatt Saleh

Buy this Book

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.

Machine Learning Fundamentals

Preface

Free Chapter

Introduction to Scikit-Learn

Supervised and Unsupervised Learning

Summary

Unsupervised Learning: Real-Life Applications

Introduction

Clustering

Exploring a Dataset: Wholesale Customers Dataset

Evaluating the Performance of Clusters

Summary

Supervised Learning: Key Steps

Introduction

Model Validation and Testing

Evaluation Metrics

Error Analysis

Summary

Supervised Learning Algorithms: Predict Annual Income

Introduction

Exploring the Dataset

Naïve Bayes Algorithm

Decision Tree Algorithm

Support Vector Machine Algorithm

Error Analysis

Summary

Artificial Neural Networks: Predict Annual Income

Introduction

Artificial Neural Networks

Applying an Artificial Neural Network

Performance Analysis

Summary

Building Your Own Program

Introduction

Program Definition

Saving and Loading a Trained Model

Interacting with a Trained Model

Summary

Appendix

Chapter 1: Introduction to scikit-learn

Chapter 2: Unsupervised Learning: Real-life Applications

Chapter 3: Supervised Learning: Key Steps

Chapter 4: Supervised Learning Algorithms: Predict Annual Income

Chapter 5: Artificial Neural Networks: Predict Annual Income

Chapter 6: Building Your Own Program

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 3: Supervised Learning: Key Steps

Activity 8: Data Partition over a Handwritten Digit Dataset

Import the digits toy dataset using scikit-learn's datasets package and create a Pandas DataFrame containing the features and target matrices. Use the following code:
```
from sklearn.datasets import load_digits
digits = load_digits()

import pandas as pd
X = pd.DataFrame(digits.data)
Y = pd.DataFrame(digits.target)
```
The shape of your features and target matrix should be as follows, respectively:
```
(1797,64) (1797,1)
```
Choose the appropriate approach for splitting the dataset and split it.
Conventional split approach (60/20/20%)
Using the train_test_split function, split the data into an initial train set and a test set:
```
from sklearn.model_selection import train_test_split

X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size=0.2)
```
The shape of the sets that you created should be as follows:
```
(1437,64) (360,64) (1437,1) (360,1)
```
Next, calculate the value of the test_size, which sets the size of the dev set equal to the size of the test set that was created previously:
```
dev_size = 360/1437
```
The result of the preceding operation is 0.2505.
Finally, split X_new and Y_new into the final train and dev sets. Use the following code:
```
X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size = 0.25)
```
The final shape of all sets is shown here:
```
X_train = (1077,64)
X_dev = (360,64)
X_test = (360,64)
Y_train = (1077,1)
Y_dev = (360,1)
Y_test = (360,1)
```
Cross-Validation Approach
Using the train_test_split function, split the data into an initial train set and a test set, just like you did previously:
```
from sklearn.model_selection import train_test_split

X_new_2, X_test_2, Y_new_2, Y_test_2 = train_test_split(X, Y, test_size=0.1)
```
Using the KFold class, perform a 10-fold split:
```
from sklearn.model_selection import KFold

kf = Kfold(n_splits = 10)
splits = kf.split(X_new_2)
```
Remember that cross-validation performs different configuration of splits, shuffling data each time. Considering this, perform a for loop that will go through all the split configurations:
```
for train_index, dev_index in splits:
  X_train_2, X_dev_2 = X_new_2.iloc[train_index], X_new_2.iloc[dev_index]

  Y_train_2, Y_dev_2 = Y_new_2.iloc[train_index], Y_new_2.iloc[dev_index]
```
The code in charge of training and evaluating the model should be inside the body of the for loop in order to train and evaluate the model with each configuration of splits.
The final shape of the sets will be as follows:
```
X_train_2 = (1456,64)
X_dev_2 = (161,64)
X_test_2 = (180,64)
Y_train_2 = (1456,1)
Y_dev_2 = (161,1)
Y_test_2 = (180,1)
```

Activity 9: Evaluating the Performance of the Model Trained over a Handwritten Dataset

Import the toy dataset boston using scikit-learn's datasets package and create a Pandas DataFrame containing the features and target matrices:

from sklearn.datasets import load_digits
data = load_digits()

import pandas as pd
X = pd.DataFrame(data.data)
Y = pd.DataFrame(data.target)

Split the data into training and testing sets. Use 20% as the size of the testing set:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.1, random_state = 0)

Train a decision tree over the train set. Then, use the model to predict the class label over the test set (hint: to train the Decision Tree, revisit Exercise 12):
```
from sklearn import tree
model = tree.DecisionTreeClassifier(random_state = 0)
model = model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)
```
Use scikit-learn to construct a confusion matrix:
```
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix (Y_test, Y_pred)
```
The output of the confusion matrix is shown as follows:
Figure 3.13: Output of the confusion matrix from Activity 9

Calculate the accuracy of the model:

from sklearn.metrics import accuracy_score
accuracy_score = accuracy_score(Y_test, Y_pred)

The accuracy is equal to 84.72%.

Calculate the precision and recall. Considering that both the precision and recall can only be calculated over binary data, assume that we are only interested in classifying instances as number 6 or any other number:

Y_test_2 = Y_test[:]
Y_test_2[Y_test_2 != 6] = 1
Y_test_2[Y_test_2 == 6] = 0

Y_pred_2 = Y_pred
Y_pred_2[Y_pred_2 != 6] = 1
Y_pred_2[Y_pred_2 == 6] = 0

From sklearn.metrics import precision_score, recall_score
precision = precision_score(Y_test_2, Y_pred_2)
recall = recall_score(Y_test_2, Y_pred_2)

The precision and recall scores should be equal to 98.41% and 98.10%, respectively.

Activity 10: Performing Error Analysis over a Model Trained to Recognize Handwritten Digits

Import the digits toy dataset using scikit-learn's datasets package and create a Pandas DataFrame containing the features and target matrices:

from sklearn.datasets import load_digits
data = load_digits()

import pandas as pd
X = pd.DataFrame(data.data)
Y = pd.DataFrame(data.target)

Split the data into training, validation, and testing sets. Use 0.1 as the size of the test set, and an equivalent number to build a validation set of the same shape:

from sklearn.model_selection import train_test_split

X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = 101)

X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size = 0.11, random_state = 101)

Create a train/dev set for both the features and the target values that contains 89 instances/labels of the train set and 89 instances/labels of the dev set:

import numpy as np
np.random.seed(101)

indices_train = np.random.randint(0, len(X_train), 89)
indices_dev = np.random.randint(0, len(X_dev), 89)

X_train_dev = pd.concat([X_train.iloc[indices_train,:], X_dev.iloc[indices_dev,:]])

Y_train_dev = pd.concat([Y_train.iloc[indices_train,:], Y_dev.iloc[indices_dev,:]])

Train a decision tree over that training set data:

from sklearn import tree

model = tree.DecisionTreeClassifier(random_state = 101)
model = model.fit(X_train, Y_train)

Calculate the error rate for all sets of data, and determine which condition is affecting the performance of the model:
```
from sklearn.metrics import accuracy_score
X_sets = [X_train, X_train_dev, X_dev, X_test]
Y_sets = [Y_train, Y_train_dev, Y_dev, Y_test]

scores = []
for i in range(0, len(X_sets)):
  pred = model.predict(X_sets[i])
  score = accuracy_score(Y_sets[i], pred)
  scores.append(score)
```
The error rates are shown in the following table:
Figure 3.14: Error rates of the Handwritten Digits model
From the preceding results of the errors, it can be concluded that the model is equally suffering from variance and data mismatch.

Machine Learning Fundamentals

By : Hyatt Saleh

Machine Learning Fundamentals

By: Hyatt Saleh

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning Fundamentals

Applied Deep Learning with PyTorch

The Deep Learning with PyTorch Workshop

Machine Learning with scikit-learn Quick Start Guide

Chapter 3: Supervised Learning: Key Steps

Activity 8: Data Partition over a Handwritten Digit Dataset

Activity 9: Evaluating the Performance of the Model Trained over a Handwritten Dataset

Activity 10: Performing Error Analysis over a Model Trained to Recognize Handwritten Digits