An improved algorithm was proposed in 2007. K-means++ addresses the problem of suboptimal clustering by introducing an additional step for a good centroids initialization.
An improved algorithm of initial centers selection looks like this:
- Select randomly any data point to be the first center
- For all other data points, calculate the distance to the first center d(x)
- Sample the next center from the weighted probability distribution, where the probability of each data point to become a next center is proportional to the square of distance d(x)2
- Until k centers are chosen, repeat step 2 and step 3
- Proceed with the standard k-means algorithm
In Swift, it looks like this:
internal mutating func chooseCentroids() { let n = data.count var minDistances = [Double](repeating: Double.infinity, count: n) var centerIndices = [Int]()
clusterID
is an integer identifier of a cluster: the first cluster has identifier zero, the second has one, and so on:
for clusterID in 0 ..< k { var pointIndex...