Before we jump to loading data, a quick overview of MapReduce is warranted. Although Druid comes prepackaged with a convenient MapReduce job to accommodate historical data, generally speaking, large distributed systems will need custom jobs to perform analyses over the entire data set.
MapReduce is a framework that breaks processing into two phases: a map phase and a reduce phase. In the map phase, a function is applied to the entire set of input data, one element at a time. Each application of the map
function results in a set of tuples, each containing a key and a value. Tuples with similar keys are then combined via the reduce
function. The reduce
function emits another set of tuples, typically by combining the values associated with the key.
The canonical "Hello World" example for MapReduce is the word count. Given a set of documents that contain words, count the occurrences of each word. (Ironically, this is very similar to our NLP example.)
The following...