In the previous recipe, Setting up Spark, we covered a basic setup of Spark. If you followed the Using HDFS recipe, you can optionally serve the data from Hadoop. In this case, you need to specify the URL of the file in this manner, hdfs://hdfs-host:port/path/direct_marketing.csv
.
We will use the same data as we did in the Implementing a star schema with fact and dimension tables recipe. However, this time we will use the spend
, history
, and recency
columns. The first column corresponds to recent purchase amounts after a direct marketing campaign, the second to historical purchase amounts, and the third column to the recency of purchase in months. The data is described in http://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html (retrieved September 2015). We will apply the popular K-means machine-learning algorithm to cluster the data. Chapter 9, Ensemble Learning and Dimensionality Reduction, pays more attention to machine learning algorithms...