Book Image

Real-Time Big Data Analytics

By : Sumit Gupta, Shilpi Saxena
Book Image

Real-Time Big Data Analytics

By: Sumit Gupta, Shilpi Saxena

Overview of this book

Enterprise has been striving hard to deal with the challenges of data arriving in real time or near real time. Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases. From the beginning of the book, we will cover the basics of varied real-time data processing frameworks and technologies. We will discuss and explain the differences between batch and real-time processing in detail, and will also explore the techniques and programming concepts using Apache Storm. Moving on, we’ll familiarize you with “Amazon Kinesis” for real-time data processing on cloud. We will further develop your understanding of real-time analytics through a comprehensive review of Apache Spark along with the high-level architecture and the building blocks of a Spark program. You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark. At the end of this book, we will introduce Spark Streaming, the streaming library of Spark, and will walk you through the emerging Lambda Architecture (LA), which provides a hybrid platform for big data processing by combining real-time and precomputed batch data to provide a near real-time view of incoming data.
Table of Contents (17 chapters)
Real-Time Big Data Analytics
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
Index

Distributed batch processing


The first and foremost point to understand is what are the different kinds of processing that can be applied to data. Well, they fall in two broad categories:

  • Batch processing

  • Sequential or inline processing

The key difference between the two is that the sequential processing works on a per tuple basis, where the events are processed as they are generated or ingested into the system. In case of batch processing, they are executed in batches. This means tuples/events are not processed as they are generated or ingested. They're processed in fixed-size batches; for example, 100 credit card transactions are clubbed into a batch and then consolidated.

Some of the key aspects of batch processing systems are as follows:

  • Size of a batch or the boundary of a batch

  • Batching (starting a batch and terminating a batch)

  • Sequencing of batches (if required by the use case)

The batch can be identified by size (which could be x number of records, for example, a 100-record batch). The batches can be more diverse and be divided into time ranges such as hourly batches, daily batches, and so on. They can be dynamic and data-driven, where a particular sequence/pattern in the input data demarcates the start of the batch and another particular one marks its end.

Once a batch boundary is demarcated, said bundle of records should be marked as a batch, which can be done by adding a header/trailer, or maybe one consolidated data structure, and so on, bundled with a batch identifier. The batching logic also performs bookkeeping and accounting for each batch being created and dispatched for processing.

In certain specific use cases, the order of records or the sequence needs to be maintained, leading to the need to sequence the batches. In these specialized scenarios, the batching logic has to do extra processing to sequence the batches, and extra caution needs to be applied to the bookkeeping for the same.

Now that we understand what batch processing is, the next step and an obvious one is to understand what distributed batch processing is. It's a computing paradigm where the tuples/records are batched and then distributed for processing across a cluster of nodes/processing units. Once each node completes the processing of its allocated batch, the results are collated and summarized for the final results. In today's application programming, when we are used to processing a huge amount of data and get results at lightning-fast speed, it is beyond the capability of a single node machine to meet these needs. We need a huge computational cluster. In computer theory, we can add computation or storage capability by two means:

Vertical scaling is a paradigm where we add more compute capability; for example, add more CPUs or more RAM to an existing node or replace the existing node with a more powerful machine. This model works well only up to an extent. You may soon hit the ceiling and your needs would outgrow what the biggest possible machine can deliver. So, this model has a flaw in the scaling, and it's essentially an issue when it comes to a single point of failure because, as you see, the entire application is running on one machine.

So you can see that vertical scaling is limited and failure prone. The higher end machines are pretty expensive too. So, the solution is horizontal scaling. I rely on clustering, where the computational capability is basically not derived from a single node, but from a collection of nodes. In this paradigm, I am operating in a model that's scalable and there is no single point of failure.

Batch processing in distributed mode

For a very long time, Hadoop was synonymous with Big Data, but now Big Data has branched off to various specialized, non-Hadoop compute segments as well. At its core, Hadoop is a distributed, batch-processing compute framework that operates upon MapReduce principles.

It has the ability to process a huge amount of data by virtue of batching and parallel processing. The key aspect is that it moves the computation to the data, instead of how it works in the traditional world, where data is moved to the computation. A model that is operative on a cluster of nodes is horizontally scalable and fail-proof.

Hadoop is a solution for offline, batch data processing. Traditionally, the NameNode was a single point of failure, but the advent of newer versions and YARN (short for Yet Another Resource Negotiator) has actually changed that limitation. From a computational perspective, YARN has brought about a major shift that has decoupled MapReduce and Hadoop, and has provided the scope of integration with other real-time, parallel processing compute engines like Spark, MPI (short for Message Processing Interface), and so on.

Push code to data

So far, the general computational models have a data flow where the data is ingested and moved to the compute engine.

The advent of distributed batch processing made changes to this and this is depicted in the following figure. The batches of data were moved to various nodes in the compute-engine cluster. This shift was seen as a major advantage to the processing arena and has brought the power of parallel processing to the application.

Moving data to compute makes sense for low volume data. But, for a Big Data use case that has humongous data computation, moving data to the compute engine may not be a sensible idea because network latency can cause a huge impact on the overall processing time. So Hadoop has shifted the world by creating batches of input data called blocks and distributing them to each node in the cluster. Take a look at this figure:

At the initialization stage, the Big Data file is pushed into HDFS. Then, the file is split into chunks (or file blocks) by the Hadoop NameNode (master node) and is placed onto individual DataNodes (slave nodes) in the cluster for concurrent processing.

The process in the cluster called Job Tracker moves execution code or processing to the data. The compute component includes a Mapper and a Reduce class. In very simple terms, a Mapper class does the job of data filtering, transformation, and splitting. By nature of a localized compute, a Mapper instance only processes the data blocks which are local to or co-located on the same data node. This concept is called data locality or proximity. Once the Mappers are executed, their outputs are shuffled through to the appropriate Reduce nodes. A Reduce class, by its functionality, is an aggregator for compiling all the results from the mappers.