Spark comes bundled with a read–eval–print loop (REPL) shell, which is a wrapper around the Scala shell. Though the Spark shell looks like a command line for simple things, in reality, a lot of complex queries can also be executed using it. A lot of times, the Spark shell is used in the initial development phase and once the code is stabilized, it is written as a class file and bundled as a jar to be run using spark-submit
flag. This chapter explores different development environments in which Spark applications can be developed.
Hadoop MapReduce's word count, which takes at least three class files and one configuration file, namely project object model (POM), becomes very simple with the Spark shell. In this recipe, we are going to create a simple one-line text file, upload it to the Hadoop distributed file system (HDFS), and use Spark to count the occurrences of words. Let's see how:
- Create the
words
directory using the following command:
$ mkdir words...