While running a Spark shell and connecting to an existing cluster, you should see something specifying the app ID
such as "Connected to Spark cluster with app ID
app-20130330015119-0001." The app ID
will match the application entry as shown in the Web UI under running applications (by default, it will be viewable on port 4040). Start by downloading a dataset to use for some experimentation. There are a number of datasets put together for The Elements of Statistical Learning, which are in a very convenient form to use. Grab the spam dataset using the following command:
wget http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data
Alternatively, you can find the spam dataset from the GitHub link at https://github.com/xsankar/fdps-vii.
Now, load it as a text file into Spark with the following command inside your Spark shell:
scala> val inFile = sc.textFile("./spam.data")
This loads the spam.data
file into Spark with each line being a separate entry in...