In Python Data Analysis, you learned about clustering—separating data into clusters without providing any hints-which is a form of unsupervised learning. Sometimes, we need to take a guess for the number of clusters, as we did in the Clustering streaming data with Spark recipe.
There is no restriction against having clusters contain other clusters. In such a case, we speak of hierarchical clustering. We need a distance metric to separate data points. Take a look at the following equations:
In this recipe, we will use Euclidean distance (9.2), provided by the SciPy pdist()
function. The distance between sets of points is given by the linkage criteria. In this recipe, we will use the single-linkage criteria (9.3) provided by the SciPy linkage()
function.