Book Image

Machine Learning Fundamentals

By : Hyatt Saleh
Book Image

Machine Learning Fundamentals

By: Hyatt Saleh

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.
Table of Contents (9 chapters)
Machine Learning Fundamentals
Preface

Chapter 2: Unsupervised Learning: Real-life Applications


Activity 3: Using Data Visualization to Aid the Preprocessing Process

  1. Load the previously downloaded dataset by using the Pandas function read_csv(). Store the dataset in a Pandas DataFrame named data:

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    np.random.seed(0)

    First, import the required libraries. Then, feed the dataset path to the Pandas function's read_csv():

    data = pd.read_csv("datasets/wholesale_customers_data.csv")
  2. Check for missing values in your DataFrame. Using the isnull() function plus the sum() function, count the missing values of the entire dataset at once:

    data.isnull().sum()

    Figure 2.16: A screenshot showing the number of missing values in the DataFrame

    As you can see from the preceding screenshot, there are no missing values in the dataset.

  3. Check for outliers in your DataFrame. Using the technique you learned in the previous chapter, label those values that fall outside of three standard deviations from the mean as outliers. The following code snippet allows you to look for outliers in the entire set of features at once. However, another valid method would be to check for outliers one feature at a time:

    outliers = {}
    for i in range(data.shape[1]):
      min_t = data[data.columns[i]].mean() - (3 * data[data.columns[i]].std())
      max_t = data[data.columns[i]].mean() + (3 * data[data.columns[i]].std())
      count = 0
      for j in data[data.columns[i]]:
        if j < min_t or j > max_t:
          count += 1
      outliers[data.columns[i]] = [count,data.shape[0]-count]
    print(outliers)

    The count of outliers for each of the features is shown in the following figure:

    Figure 2.17: A screenshot showing the output of the preceding code snippet

    As you can see from the preceding screenshot, some features do have outliers. Considering that there are only a few outliers for each feature, there are two possible ways to handle them.

    First, you could decide to delete the outliers. This decision can be supported by displaying a histogram for the features with outliers:

    plt.hist(data["Fresh"])
    plt.show()

    Figure 2.18: An example histogram plot for the "Fresh" feature

    For instance, for the feature named Fresh, it can be seen through the histogram that most instances are represented by values below 40,000. Hence, deleting the instances above that value will not affect the performance of the model.

    On the other hand, the second approach would be to leave the outliers as they are, considering that they do not represent a large portion of the dataset, which can be supported with data visualization tools using a pie chart. See the code and the output that follow:

    plt.figure(figsize=(8,8))
    plt.pie(outliers["Detergents_Paper"],autopct="%.2f")
    plt.show()

    Figure 2.19: A pie chart showing the participation of outliers from the Detergents_papers feature in the dataset

    The preceding diagram shows the participation of the outliers from the Detergents_papers feature, which was the feature with the most outliers in the dataset. Only 2.27% of the values are outliers, a value so low that it will not affect the performance of the model either.

  4. Rescale the data. For this solution, the formula for standardization has been used. Note that the formula can be applied to the entire dataset at once, instead of being applied individually to each feature:

    data_standardized = (data - data.mean())/data.std()
    data_standardized.head()

    Figure 2.20: A table showing the first five instances of the standardized dataset

Activity 4: Applying the k-means Algorithm to a Dataset

  1. Open the Jupyter Notebook that you used for the previous activity. There, you should have imported all the required libraries and stored the dataset in a variable named data. The standardized data should look as follows:

    data_standardized = (data - data.mean())/data.std()
    data_standardized.head()

    Figure 2.21: A screenshot displaying the first five instances of the standardized dataset

  2. Calculate the average distance of data points from its centroid in relation to the number of clusters. Based on this distance, select the appropriate number of clusters to train the model to.

    First, import the algorithm class:

    from sklearn.cluster import KMeans

    Next, using the code in the following snippet, calculate the average distance of data points from its centroid based on the number of clusters created:

    ideal_k = []
    for i in range(1,21):
      est_kmeans = KMeans(n_clusters=i)
      est_kmeans.fit(data_standardized)
    
      ideal_k.append([i,est_kmeans.inertia_])
    ideal_k = np.array(ideal_k)

    Finally, plot the relation to find the breaking point of the line, and select the number of clusters:

    plt.plot(ideal_k[:,0],ideal_k[:,1])
    plt.show()

    Figure 2.22: The output of the plot function used

  3. Train the model and assign a cluster to each data point in your dataset. Plot the results.

    To train the model, use the following code:

    est_kmeans = KMeans(n_clusters=6)
    est_kmeans.fit(data_standardized)
    pred_kmeans = est_kmeans.predict(data_standardized)

    The number of clusters selected is 6; however, since there is no exact breaking point, values between 5 and 10 are also acceptable.

    Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the following code:

    plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8))
    plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_kmeans, s=20)
    plt.xlim([0, 20000])
    plt.ylim([0,20000])
    plt.xlabel('Frozen')
    plt.subplot(1, 2, 1)
    plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_kmeans, s=20)
    plt.xlim([0, 20000])
    plt.ylim([0,20000])
    plt.xlabel('Grocery')
    plt.ylabel('Milk')
    plt.show()

    Figure 2.23: Two example plots obtained after the clustering process

    The subplots() function from Matplotlib has been used to plot two scatter graphs at a time.

    As can be seen from the plots, there is no obvious visual relation due to the fact that we are only able to use two of the eight features present in the dataset. However, the final output of the model creates six different clusters that represent six different profiles of clients.

Activity 5: Applying the Mean-Shift Algorithm to a Dataset

  1. Open the Jupyter Notebook that you used for the previous activity.

  2. Train the model and assign a cluster to each data point in your dataset. Plot the results.

    First, do not forget to import the algorithm class:

    from sklearn.cluster import MeanShift

    To train the model, use the following code:

    est_meanshift = MeanShift(0.4)
    est_meanshift.fit(data_standardized)
    pred_meanshift = est_meanshift.predict(data_standardized)

    The model was trained using a bandwidth of 0.4. However, feel free to test other values to see how the result changes.

    Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the snippet below. Similar to the previous activity, the separation between clusters is not visually seen due to the capability to only draw two out of the eight features:

    plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8))
    plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_meanshift, s=20)
    plt.xlim([0, 20000])
    plt.ylim([0,20000])
    plt.xlabel('Frozen')
    plt.subplot(1, 2, 1)
    plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_meanshift, s=20)
    plt.xlim([0, 20000])
    plt.ylim([0,20000])
    plt.xlabel('Grocery')
    plt.ylabel('Milk')
    plt.show()

    Figure 2.24: Example plots obtained at the end of the process

Activity 6: Applying the DBSCAN Algorithm to the Dataset

  1. Open the Jupyter Notebook that you used for the previous activity.

  2. Train the model and assign a cluster to each data point in your dataset. Plot the results.

    First, do not forget to import the algorithm class:

    from sklearn.cluster import DBSCAN

    To train the model, use the following code:

    est_dbscan = DBSCAN(eps=0.8)
    pred_dbscan = est_dbscan.fit_predict(data_standardized)

    The model was trained using an epsilon value of 0.8. However, feel free to test other values to see how the results change.

    Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the following code:

    plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8))
    plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_dbscan, s=20)
    plt.xlim([0, 20000])
    plt.ylim([0,20000])
    plt.xlabel('Frozen')
    plt.subplot(1, 2, 1)
    plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_dbscan, s=20)
    plt.xlim([0, 20000])
    plt.ylim([0,20000])
    plt.xlabel('Grocery')
    plt.ylabel('Milk')
    plt.show()

    Figure 2.25: Example plots obtained at the end of the clustering process

    Similar to the previous activity, the separation between clusters is not visually seen due to the capability to only draw two out of the eight features at once.

Activity 7: Measuring and Comparing the Performance of the Algorithms

  1. Open the Jupyter Notebook that you used for the previous activity.

  2. Calculate both the Silhouette Coefficient score and the Calinski–Harabasz index for all the models that you trained previously.

    First, do not forget to import the metrics:

    from sklearn.metrics import silhouette_score
    from sklearn.metrics import silhouette_score

    Calculate the Silhouette Coefficient score for all the algorithms, as shown in the following code:

    kmeans_score = silhouette_score(data_standardized, pred_kmeans, metric='euclidean')
    meanshift_score = silhouette_score(data_standardized, pred_meanshift, metric='euclidean')
    dbscan_score = silhouette_score(data_standardized, pred_dbscan, metric='euclidean')
    print(kmeans_score, meanshift_score, dbscan_score)

    The scores come to be around 0.355, 0.093, and 0.168 for the k-means, Mean-Shift, and DBSCAN algorithms, respectively.

    Finally, calculate the Calinski–Harabasz index for all the algorithms. The following is a snippet of the code:

    kmeans_score = calinski_harabaz_score(data_standardized, pred_kmeans)
    meanshift_score = calinski_harabaz_score(data_standardized, pred_meanshift)
    dbscan_score = calinski_harabaz_score(data_standardized, pred_dbscan)
    print(kmeans_score, meanshift_score, dbscan_score)

    The scores come to be approximately 139.8, 112.9, and 42.45 for the three algorithms in the respective order in the code snippet.

    By quickly looking at the results obtained for both metrics, it is possible to conclude that the k-means algorithm outperforms the other models, and hence, should be the one selected to solve the data problem.