Often, it is hard to get our hands on data that is labeled. Also, sometimes you might want to find underlying patterns in your dataset. In this recipe, we will learn how to build the popular k-means clustering model in Spark.
To execute this recipe, you need to have a working Spark environment. You should have already gone through the Standardizing the data recipe where we standardized the encoded census data.
No other prerequisites are required.
Just like with classification or regression models, building clustering models is pretty straightforward in Spark. Here's the code that aims to find patterns in the census data:
import pyspark.mllib.clustering as clu model = clu.KMeans.train(
final_data.map(lambda row: row[1]) , 2 , initializationMode='random' , seed=666 )