Spark comes bundled with a REPL shell, which is a wrapper around the Scala shell. Though the Spark shell looks like a command line for simple things, in reality a lot of complex queries can also be executed using it. This chapter explores different development environments in which Spark applications can be developed.
Hadoop MapReduce's word count becomes very simple with the Spark shell. In this recipe, we are going to create a simple 1-line text file, upload it to the Hadoop distributed file system (HDFS), and use Spark to count occurrences of words. Let's see how:
Create the
words
directory by using the following command:$ mkdir words
Get into the
words
directory:$ cd words
Create a
sh.txt
text file and enter"to be or not to be"
in it:$ echo "to be or not to be" > sh.txt
Start the Spark shell:
$ spark-shell
Load the
words
directory as RDD:Scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")
Count the number of lines...