In this example, we use R's cluster analysis functions to determine the clustering in the wheat dataset from http://www.ics.uci.edu/.
The R script we want to use in Jupyter is the following:
# load the wheat data set from uci.edu wheat <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", sep="\t") # define useful column names colnames(wheat) <-c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove", "undefined") # exclude incomplete cases from the data wheat <- wheat[complete.cases(wheat),] # calculate the clusters fit <- kmeans(wheat, 5) fit
Once entered into a notebook, we have something like this:
The resulting generated cluster information is K-means clustering with five clusters of sizes 29, 57, 65, 15, and 32. (Note that, since I had not set the seed value for random number to use, your results may vary.)
Cluster means are:
area perimeter compactness length width asymmetry ...