-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
The Machine Learning Workshop - Second Edition
By :
After applying a clustering algorithm, it is necessary to evaluate how well the algorithm has performed. This is especially important when it is difficult to visually evaluate the clusters; for example, when there are several features.
Usually, with supervised algorithms, it is easy to evaluate their performance by simply comparing the prediction of each instance with its true value (class). On the other hand, when dealing with unsupervised models (such as clustering algorithms), it is necessary to pursue other strategies.
In the specific case of clustering algorithms, it is possible to evaluate performance by measuring the similarity of the data points that belong to the same cluster.
Scikit-learn allows its users to use three different scores for evaluating the performance of unsupervised clustering algorithms. The main idea behind these scores is to measure how well-defined the cluster's edges are, instead of measuring the dispersion within a cluster. Hence, it is worth mentioning that the scores do not take into account the size of each cluster.
The two most commonly used scores for measuring unsupervised clustering tasks are explained as follows:
s = (b - a) / max(a,b)
The result of the score is a value between -1 and 1. The lower the value, the worse the performance of the algorithm. Values around 0 will imply overlapping of clusters. It is also important to clarify that this score does not work very well when using density-based algorithms such as DBSCAN.
The higher the value of the Calinski–Harabasz Index, the better the definition and separation of the clusters. There is no acceptable cut-off value, so the performance of the algorithms using this index is evaluated through comparison, where the algorithm with the highest value is the one that performs best. As with the Silhouette Coefficient, this score does not perform well on density-based algorithms such as DBSCAN.
Unfortunately, the scikit-learn library does not contain other methods for effectively measuring the performance of density-based clustering algorithms, and although the methods mentioned here may work in some cases to measure the performance of these algorithms, when they do not, there is no other way to measure this other than via manual evaluation.
However, it is worth mentioning that there are additional performance measures in scikit-learn for cases where a ground truth label is known, known as supervised clustering; for instance, when performing clustering over a set of observations of journalism students who have already signed up for a major or a specialization area. If we were to use their demographic information as well as some student records to categorize them into clusters that represent their choice of major, it would be possible to compare the predicted classification with the actual classification.
Some of these measures are as follows:
metrics module, and it receives the list of ground truth clusters and the list of predicted clusters as inputs, as follows:from sklearn.metrics import homogeneity_score score = homogeneity_score(true_labels, predicted_labels)
from sklearn.metrics import completeness_score score = completeness_score(true_labels, predicted_labels)
Note
To explore other measures that evaluate the performance of supervised clustering tasks, visit the following URL, under the clustering section: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
In this exercise, we will learn how to calculate the two scores we discussed in the previous section that are available in scikit-learn. Perform the following steps to complete this exercise:
from sklearn.metrics import silhouette_score from sklearn.metrics import calinski_harabasz_score
The input parameters of the silhouette_score() function are the data, the predicted values of the model (the clusters assigned to each data point), and the distance measure:
Note
The code snippet shown here uses a backslash ( \ ) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
kmeans_score = silhouette_score(data, pred_kmeans, \ metric='euclidean') meanshift_score = silhouette_score(data, pred_meanshift, \ metric='euclidean') dbscan_score = silhouette_score(data, pred_dbscan, \ metric='euclidean') print(kmeans_score, meanshift_score, dbscan_score)
The first three lines call the silhouette_score() function over each of the models (the k-mean, the mean-shift, and the DBSCAN algorithms) by inputting the data, the predictions, and the distance measure. The last line of code prints out the score for each of the models.
The scores come to be around 0.359, 0.3705, and 0.1139 for the k-means, mean-shift, and DBSCAN algorithms, respectively.
You can observe that both k-means and mean-shift algorithms have similar scores, while the DBSCAN score is closer to zero. This can indicate that the performance of the first two algorithms is much better, and hence, the DBSCAN algorithm should not be considered to solve the data problem.
Nevertheless, it is important to remember that this type of score does not perform well when evaluating the DBSCAN algorithm. This is basically because as one cluster is surrounding the other one, the score can interpret that as an overlap when, in reality, the clusters are very well-defined, as is the case of the current dataset.
calinski_harabasz_score() function are the data and the predicted values of the model (the clusters assigned to each data point):kmeans_score = calinski_harabasz_score(data, pred_kmeans) meanshift_score = calinski_harabasz_score(data, pred_meanshift) dbscan_score = calinski_harabasz_score(data, pred_dbscan) print(kmeans_score, meanshift_score, dbscan_score)
Again, the first three lines apply the calinski_harabasz_score() function over the three models by passing the data and the prediction as inputs. The last line prints out the results.
The values come to approximately 1379.7, 1305.14, and 0.0017 for the k-means, mean-shift, and DBSCAN algorithms, respectively. Once again, the results are similar to the ones we obtained using the Silhouette Coefficient score, where both the k-means and mean-shift algorithms performed similarly well, while the DBSCAN algorithm did not.
Moreover, it is worth mentioning that the scale of each method (the Silhouette Coefficient score and the Calinski-Harabasz index) differs significantly, so they are not easily comparable.
Note
To access the source code for this specific section, please refer to https://packt.live/3e3YIif.
You can also run this example online at https://packt.live/2MXOQdZ. You must execute the entire Notebook in order to get the desired result.
You have successfully measured the performance of three different clustering algorithms.
In conclusion, the scores presented in this topic are a way of evaluating the performance of clustering algorithms. However, it is important to consider that the results from these scores are not definitive as their performance varies from algorithm to algorithm.
You might find yourself in a situation in which you are not sure about the performance of the algorithms as it cannot be evaluated graphically. Therefore, you will have to measure the performance of the algorithms using numerical metrics that can be used to make comparisons. For the previously trained models, calculate the Silhouette Coefficient score and the Calinski-Harabasz index to measure the performance of the algorithms. The following steps provide hints regarding how you can do this:
The results may differ based on the choices you made during the development of the previous activities and how you initialized certain parameters in each algorithm. Nevertheless, the following results can be expected for a k-means algorithm set to divide the dataset into six clusters, a mean-shift algorithm with a bandwidth equal to 0.4, and a DBSCAN algorithm with an epsilon score of 0.8:
Silhouette Coefficient K-means = 0.3515 mean-shift = 0.0933 DBSCAN = 0.1685 Calinski-Harabasz Index K-means = 145.73 mean-shift = 112.90 DBSCAN = 42.45
Note
The solution for this activity can be found via this link.
Change the font size
Change margin width
Change background colour