In this chapter, we learned how the clustering of data is very efficient and can be used to facilitate the faster classification of new features by classifying a feature as belonging to the class that is represented in the cluster of that feature. An appropriate number of clusters can be determined through cross-validation, by choosing the one that results in the most accurate classification.
Clustering orders data according to its similarity. The more clusters there are, the greater the similarity between the features in a cluster, but the fewer features in a cluster there are.
We also learned that the k-means algorithm is a clustering algorithm that tries to cluster features in such a way that the mutual distance of the features in a cluster is minimized. To do this, the algorithm computes the centroid of each cluster and a feature belongs to the cluster whose centroid is closest to it. The algorithm finishes the computation of the clusters as soon as they or their centroids no longer...