Book Image

Real-Time Big Data Analytics

By : Sumit Gupta, Shilpi Saxena
Book Image

Real-Time Big Data Analytics

By: Sumit Gupta, Shilpi Saxena

Overview of this book

Enterprise has been striving hard to deal with the challenges of data arriving in real time or near real time. Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases. From the beginning of the book, we will cover the basics of varied real-time data processing frameworks and technologies. We will discuss and explain the differences between batch and real-time processing in detail, and will also explore the techniques and programming concepts using Apache Storm. Moving on, we’ll familiarize you with “Amazon Kinesis” for real-time data processing on cloud. We will further develop your understanding of real-time analytics through a comprehensive review of Apache Spark along with the high-level architecture and the building blocks of a Spark program. You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark. At the end of this book, we will introduce Spark Streaming, the streaming library of Spark, and will walk you through the emerging Lambda Architecture (LA), which provides a hybrid platform for big data processing by combining real-time and precomputed batch data to provide a near real-time view of incoming data.
Table of Contents (17 chapters)
Real-Time Big Data Analytics
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
Index

Preface

Processing historical data for the past 10-20 years, performing analytics, and finally producing business insights is the most popular use case for today's modern enterprises.

Enterprises have been focusing on developing data warehouses (https://en.wikipedia.org/wiki/Data_warehouse) where they want to store the data fetched from every possible data source and leverage various BI tools to provide analytics over the data stored in these data warehouses. But developing data warehouses is a complex, time consuming, and costly process, which requires a considerable investment, both in terms of money and time.

No doubt that the emergence of Hadoop and its ecosystem have provided a new paradigm or architecture to solve large data problems where it provides a low cost and scalable solution which processes terabytes of data in a few hours which earlier could have taken days. But this is only one side of the coin. Hadoop was meant for batch processes while there are bunch of other business use cases that are required to perform analytics and produce business insights in real or near real-time (subseconds SLA). This was called real-time analytics (RTA) or near real-time analytics (NRTA) and sometimes it was also termed as "fast data" where it implied the ability to make near real-time decisions and enable "orders-of-magnitude" improvements in elapsed time to decisions for businesses.

A number of powerful, easy to use open source platforms have emerged to solve these enterprise real-time analytics data use cases. Two of the most notable ones are Apache Storm and Apache Spark, which offer real-time data processing and analytics capabilities to a much wider range of potential users. Both projects are a part of the Apache Software Foundation and while the two tools provide overlapping capabilities, they still have distinctive features and different roles to play.

Interesting isn't it?

Let's move forward and jump into the nitty gritty of real-time Big Data analytics with Apache Storm and Apache Spark. This book provides you with the skills required to quickly design, implement, and deploy your real-time analytics using real-world examples of Big Data use cases.

What this book covers

Chapter 1, Introducing the Big Data Technology Landscape and Analytics Platform, sets the context by providing an overview of the Big Data technology landscape, the various kinds of data processing that are handled on Big Data platforms, and the various types of platforms available for performing analytics. It introduces the paradigm of distributed processing of large data in batch and real-time or near real-time. It also talks about the distributed databases to handle high velocity/frequency reads or writes.

Chapter 2, Getting Acquainted with Storm, introduces the concepts, architecture, and programming with Apache Storm as a real-time or near real-time data processing framework. It talks about the various concepts of Storm, such as spouts, bolts, Storm parallelism, and so on. It also explains the usage of Storm in the world of real-time Big Data analytics with sufficient use cases and examples.

Chapter 3, Processing Data with Storm, is focused on various internals and operations, such as filters, joins, and aggregators exposed by Apache Storm to process the streaming of data in real or near real-time. It showcases the integration of Storm with various input data sources, such as Apache Kafka, sockets, filesystems, and so on, and finally leverages the Storm JDBC framework for persisting the processed data. It also talks about the various enterprise concerns in stream processing, such as reliability, acknowledgement of messages, and so on, in Storm.

Chapter 4, Introduction to Trident and Optimizing Storm Performance, examines the processing of transactional data in real or near real-time. It introduces Trident as a real time processing framework which is used primarily for processing transactional data. It talks about the various constructs for handling transactional use cases using Trident. This chapter also talks about various concepts and parameters available and their applicability for monitoring, optimizing, and performance tuning the Storm framework and its jobs. It touches the internals of Storm such as LMAX, ring buffer, ZeroMQ, and more.

Chapter 5, Getting Acquainted with Kinesis, talks about the real-time data processing technology available on the cloud—the Kinesis service for real-time data processing from Amazon Web Services (AWS). It starts with the explanation of the architecture and components of Kinesis and then illustrates an end-to-end example of real-time alert generation using various client libraries, such as KCL, KPL, and so on.

Chapter 6, Getting Acquainted with Spark, introduces the fundamentals of Apache Spark along with the high-level architecture and the building blocks for a Spark program. It starts with the overview of Spark and talks about the applications and usage of Spark in varied batch and real-time use cases. Further, the chapter talks about high-level architecture and various components of Spark and finally towards the end, the chapter also discusses the installation and configuration of a Spark cluster and execution of the first Spark job.

Chapter 7, Programming with RDDs, provides a code-level walkthrough of Spark RDDs. It talks about various kinds of operations exposed by RDD APIs along with their usage and applicability to perform data transformation and persistence. It also showcases the integration of Spark with NoSQL databases, such as Apache Cassandra.

Chapter 8, SQL Query Engine for Spark – Spark SQL, introduces a SQL style programming interface called Spark SQL for working with Spark. It familiarizes the reader with how to work with varied datasets, such as Parquet or Hive and build queries using DataFrames or raw SQL; it also makes recommendations on best practices.

Chapter 9, Analysis of Streaming Data Using Spark Streaming, introduces another extension of Spark—Spark Streaming for capturing and processing streaming data in real or near real-time. It starts with the architecture of Spark and also briefly talks about the varied APIs and operations exposed by Spark Streaming for data loading, transformations, and persistence. Further, the chapter also talks about the integration of Spark SQL and Spark Streaming for querying data in real time. Finally, towards the end, it also discusses the deployment and monitoring aspects of Spark Streaming jobs.

Chapter 10, Introducing Lambda Architecture, walks the reader through the emerging Lambda Architecture, which provides a hybrid platform for Big Data processing by combining real-time and pre-computed batch data to provide a near real-time view of the data. It leverages Apache Spark and discusses the realization of Lambda Architecture with a real life use case.

What you need for this book

Readers should have programming experience in Java or Scala and some basic knowledge or understanding of any distributed computing platform such as Apache Hadoop.

Who this book is for

If you are a Big Data architect, developer, or a programmer who wants to develop applications or frameworks to implement real-time analytics using open source technologies, then this book is for you. This book is aimed at competent developers who have basic knowledge and understanding of Java or Scala to allow efficient programming of core elements and applications.

If you are reading this book, then you probably are familiar with the nuisances and challenges of large data or Big Data. This book will cover the various tools and technologies available for processing and analyzing streaming data or data arriving at high frequency in real or near real-time. It will cover the paradigm of in-memory distributed computing offered by various tools and technologies such as Apache Storm, Spark, Kinesis, and so on.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The PATH variable should have the path to Python installation on your machine."

A block of code is set as follows:

public class Count implements CombinerAggregator<Long> {
   @Override
   public Long init(TridentTuple tuple) {
      return 1L;
   }

Any command-line input or output is written as follows:

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "The landing page on Storm UI first talks about Cluster Summary."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.

  2. Hover the mouse pointer on the SUPPORT tab at the top.

  3. Click on Code Downloads & Errata.

  4. Enter the name of the book in the Search box.

  5. Select the book for which you're looking to download the code files.

  6. Choose from the drop-down menu where you purchased this book from.

  7. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac

  • 7-Zip / PeaZip for Linux

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.