Machine Learning Fundamentals

Machine Learning Fundamentals

By : Hyatt Saleh

Buy this Book

Machine Learning Fundamentals

By: Hyatt Saleh

Buy this Book

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.

Machine Learning Fundamentals

Preface

Free Chapter

Introduction to Scikit-Learn

Supervised and Unsupervised Learning

Summary

Unsupervised Learning: Real-Life Applications

Introduction

Clustering

Exploring a Dataset: Wholesale Customers Dataset

Evaluating the Performance of Clusters

Summary

Supervised Learning: Key Steps

Introduction

Model Validation and Testing

Evaluation Metrics

Error Analysis

Summary

Supervised Learning Algorithms: Predict Annual Income

Introduction

Exploring the Dataset

Naïve Bayes Algorithm

Decision Tree Algorithm

Support Vector Machine Algorithm

Error Analysis

Summary

Artificial Neural Networks: Predict Annual Income

Introduction

Artificial Neural Networks

Applying an Artificial Neural Network

Performance Analysis

Summary

Building Your Own Program

Introduction

Program Definition

Saving and Loading a Trained Model

Interacting with a Trained Model

Summary

Appendix

Chapter 1: Introduction to scikit-learn

Chapter 2: Unsupervised Learning: Real-life Applications

Chapter 3: Supervised Learning: Key Steps

Chapter 4: Supervised Learning Algorithms: Predict Annual Income

Chapter 5: Artificial Neural Networks: Predict Annual Income

Chapter 6: Building Your Own Program

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 2: Unsupervised Learning: Real-life Applications

Activity 3: Using Data Visualization to Aid the Preprocessing Process

Load the previously downloaded dataset by using the Pandas function read_csv(). Store the dataset in a Pandas DataFrame named data:
```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
```
First, import the required libraries. Then, feed the dataset path to the Pandas function's read_csv():
```
data = pd.read_csv("datasets/wholesale_customers_data.csv")
```
Check for missing values in your DataFrame. Using the isnull() function plus the sum() function, count the missing values of the entire dataset at once:
```
data.isnull().sum()
```
Figure 2.16: A screenshot showing the number of missing values in the DataFrame
As you can see from the preceding screenshot, there are no missing values in the dataset.
Check for outliers in your DataFrame. Using the technique you learned in the previous chapter, label those values that fall outside of three standard deviations from the mean as outliers. The following code snippet allows you to look for outliers in the entire set of features at once. However, another valid method would be to check for outliers one feature at a time:
```
outliers = {}
for i in range(data.shape[1]):
  min_t = data[data.columns[i]].mean() - (3 * data[data.columns[i]].std())
  max_t = data[data.columns[i]].mean() + (3 * data[data.columns[i]].std())
  count = 0
  for j in data[data.columns[i]]:
    if j < min_t or j > max_t:
      count += 1
  outliers[data.columns[i]] = [count,data.shape[0]-count]
print(outliers)
```
The count of outliers for each of the features is shown in the following figure:
Figure 2.17: A screenshot showing the output of the preceding code snippet
As you can see from the preceding screenshot, some features do have outliers. Considering that there are only a few outliers for each feature, there are two possible ways to handle them.
First, you could decide to delete the outliers. This decision can be supported by displaying a histogram for the features with outliers:
```
plt.hist(data["Fresh"])
plt.show()
```
Figure 2.18: An example histogram plot for the "Fresh" feature
For instance, for the feature named Fresh, it can be seen through the histogram that most instances are represented by values below 40,000. Hence, deleting the instances above that value will not affect the performance of the model.
On the other hand, the second approach would be to leave the outliers as they are, considering that they do not represent a large portion of the dataset, which can be supported with data visualization tools using a pie chart. See the code and the output that follow:
```
plt.figure(figsize=(8,8))
plt.pie(outliers["Detergents_Paper"],autopct="%.2f")
plt.show()
```
Figure 2.19: A pie chart showing the participation of outliers from the Detergents_papers feature in the dataset
The preceding diagram shows the participation of the outliers from the Detergents_papers feature, which was the feature with the most outliers in the dataset. Only 2.27% of the values are outliers, a value so low that it will not affect the performance of the model either.
Rescale the data. For this solution, the formula for standardization has been used. Note that the formula can be applied to the entire dataset at once, instead of being applied individually to each feature:
```
data_standardized = (data - data.mean())/data.std()
data_standardized.head()
```
Figure 2.20: A table showing the first five instances of the standardized dataset

Activity 4: Applying the k-means Algorithm to a Dataset

Open the Jupyter Notebook that you used for the previous activity. There, you should have imported all the required libraries and stored the dataset in a variable named data. The standardized data should look as follows:
```
data_standardized = (data - data.mean())/data.std()
data_standardized.head()
```
Figure 2.21: A screenshot displaying the first five instances of the standardized dataset
Calculate the average distance of data points from its centroid in relation to the number of clusters. Based on this distance, select the appropriate number of clusters to train the model to.
First, import the algorithm class:
```
from sklearn.cluster import KMeans
```
Next, using the code in the following snippet, calculate the average distance of data points from its centroid based on the number of clusters created:
```
ideal_k = []
for i in range(1,21):
  est_kmeans = KMeans(n_clusters=i)
  est_kmeans.fit(data_standardized)

  ideal_k.append([i,est_kmeans.inertia_])
ideal_k = np.array(ideal_k)
```
Finally, plot the relation to find the breaking point of the line, and select the number of clusters:
```
plt.plot(ideal_k[:,0],ideal_k[:,1])
plt.show()
```
Figure 2.22: The output of the plot function used
Train the model and assign a cluster to each data point in your dataset. Plot the results.
To train the model, use the following code:
```
est_kmeans = KMeans(n_clusters=6)
est_kmeans.fit(data_standardized)
pred_kmeans = est_kmeans.predict(data_standardized)
```
The number of clusters selected is 6; however, since there is no exact breaking point, values between 5 and 10 are also acceptable.
Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the following code:
```
plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8))
plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_kmeans, s=20)
plt.xlim([0, 20000])
plt.ylim([0,20000])
plt.xlabel('Frozen')
plt.subplot(1, 2, 1)
plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_kmeans, s=20)
plt.xlim([0, 20000])
plt.ylim([0,20000])
plt.xlabel('Grocery')
plt.ylabel('Milk')
plt.show()
```
Figure 2.23: Two example plots obtained after the clustering process
The subplots() function from Matplotlib has been used to plot two scatter graphs at a time.
As can be seen from the plots, there is no obvious visual relation due to the fact that we are only able to use two of the eight features present in the dataset. However, the final output of the model creates six different clusters that represent six different profiles of clients.

Activity 5: Applying the Mean-Shift Algorithm to a Dataset

Open the Jupyter Notebook that you used for the previous activity.
Train the model and assign a cluster to each data point in your dataset. Plot the results.
First, do not forget to import the algorithm class:
```
from sklearn.cluster import MeanShift
```
To train the model, use the following code:
```
est_meanshift = MeanShift(0.4)
est_meanshift.fit(data_standardized)
pred_meanshift = est_meanshift.predict(data_standardized)
```
The model was trained using a bandwidth of 0.4. However, feel free to test other values to see how the result changes.
Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the snippet below. Similar to the previous activity, the separation between clusters is not visually seen due to the capability to only draw two out of the eight features:
```
plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8))
plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_meanshift, s=20)
plt.xlim([0, 20000])
plt.ylim([0,20000])
plt.xlabel('Frozen')
plt.subplot(1, 2, 1)
plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_meanshift, s=20)
plt.xlim([0, 20000])
plt.ylim([0,20000])
plt.xlabel('Grocery')
plt.ylabel('Milk')
plt.show()
```
Figure 2.24: Example plots obtained at the end of the process

Activity 6: Applying the DBSCAN Algorithm to the Dataset

Open the Jupyter Notebook that you used for the previous activity.
Train the model and assign a cluster to each data point in your dataset. Plot the results.
First, do not forget to import the algorithm class:
```
from sklearn.cluster import DBSCAN
```
To train the model, use the following code:
```
est_dbscan = DBSCAN(eps=0.8)
pred_dbscan = est_dbscan.fit_predict(data_standardized)
```
The model was trained using an epsilon value of 0.8. However, feel free to test other values to see how the results change.
Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the following code:
```
plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8))
plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_dbscan, s=20)
plt.xlim([0, 20000])
plt.ylim([0,20000])
plt.xlabel('Frozen')
plt.subplot(1, 2, 1)
plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_dbscan, s=20)
plt.xlim([0, 20000])
plt.ylim([0,20000])
plt.xlabel('Grocery')
plt.ylabel('Milk')
plt.show()
```
Figure 2.25: Example plots obtained at the end of the clustering process
Similar to the previous activity, the separation between clusters is not visually seen due to the capability to only draw two out of the eight features at once.

Activity 7: Measuring and Comparing the Performance of the Algorithms

Open the Jupyter Notebook that you used for the previous activity.
Calculate both the Silhouette Coefficient score and the Calinski–Harabasz index for all the models that you trained previously.
First, do not forget to import the metrics:
```
from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_score
```
Calculate the Silhouette Coefficient score for all the algorithms, as shown in the following code:
```
kmeans_score = silhouette_score(data_standardized, pred_kmeans, metric='euclidean')
meanshift_score = silhouette_score(data_standardized, pred_meanshift, metric='euclidean')
dbscan_score = silhouette_score(data_standardized, pred_dbscan, metric='euclidean')
print(kmeans_score, meanshift_score, dbscan_score)
```
The scores come to be around 0.355, 0.093, and 0.168 for the k-means, Mean-Shift, and DBSCAN algorithms, respectively.
Finally, calculate the Calinski–Harabasz index for all the algorithms. The following is a snippet of the code:
```
kmeans_score = calinski_harabaz_score(data_standardized, pred_kmeans)
meanshift_score = calinski_harabaz_score(data_standardized, pred_meanshift)
dbscan_score = calinski_harabaz_score(data_standardized, pred_dbscan)
print(kmeans_score, meanshift_score, dbscan_score)
```
The scores come to be approximately 139.8, 112.9, and 42.45 for the three algorithms in the respective order in the code snippet.
By quickly looking at the results obtained for both metrics, it is possible to conclude that the k-means algorithm outperforms the other models, and hence, should be the one selected to solve the data problem.

Machine Learning Fundamentals

By : Hyatt Saleh

Machine Learning Fundamentals

By: Hyatt Saleh

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning Fundamentals

Applied Deep Learning with PyTorch

The Deep Learning with PyTorch Workshop

Machine Learning with scikit-learn Quick Start Guide

Chapter 2: Unsupervised Learning: Real-life Applications

Activity 3: Using Data Visualization to Aid the Preprocessing Process

Activity 4: Applying the k-means Algorithm to a Dataset

Activity 5: Applying the Mean-Shift Algorithm to a Dataset

Activity 6: Applying the DBSCAN Algorithm to the Dataset

Activity 7: Measuring and Comparing the Performance of the Algorithms