After adding Scalding as a project dependency, we can now create our first Scalding job as src/main/scala/WordCountJob.scala
:
import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.toLowerCase.split("\\s+") } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) }
The Scalding code above implements a cascading flow using an input file as source and stores results into another file that is used as an output tap. The pipeline tokenizes lines into words and calculates the number of times each word appears in the input text.
Note
Find complete project files in the code accompanying this book at http://github.com/scalding-io/ProgrammingWithScalding.
We can create a dummy file to use as input with the following command:
$ echo "This is a happy day. A day to remember" > input.txt
Scalding supports two types of execution modes: local mode and...