Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying The Machine Learning Workshop
  • Table Of Contents Toc
The Machine Learning Workshop

The Machine Learning Workshop - Second Edition

By : Hyatt Saleh
4.3 (6)
close
close
The Machine Learning Workshop

The Machine Learning Workshop

4.3 (6)
By: Hyatt Saleh

Overview of this book

Machine learning algorithms are an integral part of almost all modern applications. To make the learning process faster and more accurate, you need a tool flexible and powerful enough to help you build machine learning algorithms quickly and easily. With The Machine Learning Workshop, you'll master the scikit-learn library and become proficient in developing clever machine learning algorithms. The Machine Learning Workshop begins by demonstrating how unsupervised and supervised learning algorithms work by analyzing a real-world dataset of wholesale customers. Once you've got to grips with the basics, you'll develop an artificial neural network using scikit-learn and then improve its performance by fine-tuning hyperparameters. Towards the end of the workshop, you'll study the dataset of a bank's marketing activities and build machine learning models that can list clients who are likely to subscribe to a term deposit. You'll also learn how to compare these models and select the optimal one. By the end of The Machine Learning Workshop, you'll not only have learned the difference between supervised and unsupervised models and their applications in the real world, but you'll also have developed the skills required to get started with programming your very own machine learning algorithms.
Table of Contents (8 chapters)
close
close
Preface

Mean-Shift Algorithm

The mean-shift algorithm works by assigning each data point a cluster based on the density of the data points in the data space, also known as the mode in a distribution function. Contrary to the k-means algorithm, the mean-shift algorithm does not require you to specify the number of clusters as a parameter.

The algorithm works by modeling the data points as a distribution function, where high-density areas (high concentration of data points) represent high peaks. Then, the general idea is to shift each data point until it reaches its nearest peak, which becomes a cluster.

Understanding the Algorithm

The first step of the mean-shift algorithm is to represent the data points as a density distribution. To do so, the algorithm builds upon the idea of Kernel Density Estimation (KDE), which is a method that's used to estimate the distribution of a set of data:

Figure 2.9: An image depicting the idea behind Kernel Density Estimation

Figure 2.9: An image depicting the idea behind Kernel Density Estimation

In the preceding diagram, the dots at the bottom of the shape represent the data points that the user inputs, while the cone-shaped lines represent the estimated distribution of the data points. The peaks (high-density areas) will be the clusters. The process of assigning data points to each cluster is as follows:

  1. A window of a specified size (bandwidth) is drawn around each data point.
  2. The mean of the data inside the window is computed.
  3. The center of the window is shifted to the mean.

Steps 2 and 3 are repeated until the data point reaches a peak, which will determine the cluster that it belongs to.

The bandwidth value should be coherent with the distribution of the data points in the dataset. For example, for a dataset normalized between 0 and 1, the bandwidth value should be within that range, while for a dataset with all values between 1,000 and 2,000, it would make more sense to have a bandwidth between 100 and 500.

In the following diagram, the estimated distribution is represented by the lines, while the data points are the dots. In each of the boxes, the data points shift to the nearest peak. All the data points in a certain peak belong to that cluster:

Figure 2.10: A sequence of images illustrating the working of the mean-shift algorithm

Figure 2.10: A sequence of images illustrating the working of the mean-shift algorithm

The number of shifts that a data point has to make to reach a peak depends on its bandwidth (the size of the window) and its distance from the peak.

Note

To explore all the parameters of the mean-shift algorithm in scikit-learn, visit http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html.

Exercise 2.03: Importing and Training the Mean-Shift Algorithm over a Dataset

The following exercise will be performed using the same dataset that we loaded in Exercise 2.01, Plotting a Histogram of One Feature from the Circles Dataset. Considering this, use the same Jupyter Notebook that you used to develop the previous exercises. Perform the following steps to complete this exercise:

  1. Open the Jupyter Notebook that you used for the previous exercise.
  2. Import the k-means algorithm class from scikit-learn as follows:
    from sklearn.cluster import MeanShift
  3. Train the model with a bandwidth of 0.5:
    est_meanshift = MeanShift(0.5)
    est_meanshift.fit(data)
    pred_meanshift = est_meanshift.predict(data)

    First, the model is instantiated with a bandwidth of 0.5. Next, the model is fit to the data. Finally, the model is used to assign a cluster to each data point.

    Considering that the dataset contains values ranging from −1 to 1, the bandwidth value should not be above 1. The value of 0.5 was chosen after trying out other values, such as 0.1 and 0.9.

    Note

    Take into account the fact that the bandwidth is a parameter of the algorithm and that, as a parameter, it can be fine-tuned to arrive at the best performance. This fine-tuning process will be covered in Chapter 3, Supervised Learning – Key Steps.

  4. Plot the results from clustering the data points into clusters:
    plt.scatter(data.iloc[:,0], data.iloc[:,1], c=pred_meanshift)
    plt.show()

    The output is as follows:

    Figure 2.11: The plot obtained using the preceding code

Figure 2.11: The plot obtained using the preceding code

Again, as the dataset only contains two features, both are passed as inputs to the scatter function, which become the values of the axes. Also, the labels that were obtained from the clustering process are used as the colors to display the data points.

The total number of clusters that have been created is four.

Note

To access the source code for this exercise, please refer to https://packt.live/37vBOOk.

You can also run this example online at https://packt.live/3e6uqM2. You must execute the entire Notebook in order to get the desired result.

You have successfully imported and trained the mean-shift algorithm.

In conclusion, the mean-shift algorithm starts by drawing the distribution function that represents the set of data points. This process consists of creating peaks in high-density areas, while leaving the areas with a low density flat.

Following this, the algorithm proceeds to classify the data points into clusters by shifting each point slowly and iteratively until it reaches a peak, which becomes its cluster.

Activity 2.03: Applying the Mean-Shift Algorithm to a Dataset

In this activity, you will apply the mean-shift algorithm to the dataset to see which algorithm fits the data better. Therefore, using the previously loaded Wholesale Consumers dataset, apply the mean-shift algorithm to the data and classify the data into clusters. Perform the following steps to complete this activity:

  1. Open the Jupyter Notebook that you used for the previous activity.

    Note

    Considering that you are using the same Jupyter Notebook, be careful not to overwrite any previous variables.

  2. Train the model and assign a cluster to each data point in your dataset. Plot the results.

    The visualization of clusters will differ based on the bandwidth and the features that have been chosen to be plotted.

    Note

    The solution for this activity can be found via this link.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
The Machine Learning Workshop
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon