Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Overview of this book

Table of Contents (17 chapters)
Storm Blueprints: Patterns for Distributed Real-time Computation
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Hadoop


Before we jump to loading data, a quick overview of MapReduce is warranted. Although Druid comes prepackaged with a convenient MapReduce job to accommodate historical data, generally speaking, large distributed systems will need custom jobs to perform analyses over the entire data set.

An overview of MapReduce

MapReduce is a framework that breaks processing into two phases: a map phase and a reduce phase. In the map phase, a function is applied to the entire set of input data, one element at a time. Each application of the map function results in a set of tuples, each containing a key and a value. Tuples with similar keys are then combined via the reduce function. The reduce function emits another set of tuples, typically by combining the values associated with the key.

The canonical "Hello World" example for MapReduce is the word count. Given a set of documents that contain words, count the occurrences of each word. (Ironically, this is very similar to our NLP example.)

The following...