In clustering, as in classification, we are interested in finding the law that allows us to assign observations to the correct class. But unlike classification, we also have to find a plausible subdivision of our classes.
While in classification, we have some help from the target (the classification provided in the training set), in the case of clustering, we cannot rely on any additional information and we have to deduce the classes by studying spatial distribution of data.
The areas where data is thickened corresponds to similar observation groups. If we can identify observations that are similar to each other and at the same time different from those of another cluster, we can assume that these two clusters match different conditions. At this point, there are two things we need to go more deeply into:
- How to measure similarity
- How to define a grouping
The concept of distance and how to define a group are the two ingredients that describe a clustering algorithm.