Book Image

Mastering Hadoop

By : Sandeep Karanth
Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Table of Contents (21 chapters)
Mastering Hadoop
About the Author
About the Reviewers

Batch processing versus streaming

MapReduce is a batch-processing model. The data is allowed to accumulate before processing is done on it. This leads to larger turnaround times. It can also lead to pressures on storage, memory, and compute resources of the system. A batch of data needs to be staged till analysis begins and ends, thus occupying storage resources. Analyzing a large piece of data will mean a peak load for a short amount of time on the nodes of the compute cluster.

Batch models also lead to poor utilization of the cluster resources. During data accumulation, the cluster compute and memory are idle. However, during analysis, they have peak load. Provisioning of the cluster must cater to the peak load.

The disadvantages of batch-processing systems are overcome by using streaming computation models. Instead of moving the computation to the data, data is streamed through computation nodes. Each compute node operates on the data point or a small window of data to analyze and output...