Real-Time Big Data Analytics

Real-Time Big Data Analytics

By : Sumit Gupta, Shilpi Saxena

Buy this Book

Real-Time Big Data Analytics

By: Sumit Gupta, Shilpi Saxena

Buy this Book

Overview of this book

Enterprise has been striving hard to deal with the challenges of data arriving in real time or near real time. Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases. From the beginning of the book, we will cover the basics of varied real-time data processing frameworks and technologies. We will discuss and explain the differences between batch and real-time processing in detail, and will also explore the techniques and programming concepts using Apache Storm. Moving on, we’ll familiarize you with “Amazon Kinesis” for real-time data processing on cloud. We will further develop your understanding of real-time analytics through a comprehensive review of Apache Spark along with the high-level architecture and the building blocks of a Spark program. You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark. At the end of this book, we will introduce Spark Streaming, the streaming library of Spark, and will walk you through the emerging Lambda Architecture (LA), which provides a hybrid platform for big data processing by combining real-time and precomputed batch data to provide a near real-time view of incoming data.

Real-Time Big Data Analytics

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Introducing the Big Data Technology Landscape and Analytics Platform

Big Data – a phenomenon

The Big Data dimensional paradigm

The Big Data ecosystem

The Big Data infrastructure

Components of the Big Data ecosystem

Distributed batch processing

Distributed databases (NoSQL)

Real-time processing

Summary

Getting Acquainted with Storm

An overview of Storm

Storm architecture and its components

How and when to use Storm

Storm internals

Summary

Processing Data with Storm

Storm input sources

Other sources for input to Storm

Reliability of data processing

Storm simple patterns

Storm persistence

Summary

Introduction to Trident and Optimizing Storm Performance

Working with Trident

Understanding LMAX

Storm internode communication

Understanding the Storm UI

Optimizing Storm performance

Summary

Getting Acquainted with Kinesis

Architectural overview of Kinesis

Creating a Kinesis streaming service

Summary

Getting Acquainted with Spark

An overview of Spark

The architecture of Spark

Resilient distributed datasets (RDD)

Writing and executing our first Spark program

Summary

Programming with RDDs

Understanding Spark transformations and actions

Programming Spark transformations and actions

Handling persistence in Spark

Summary

SQL Query Engine for Spark – Spark SQL

The architecture of Spark SQL

Coding our first Spark SQL job

Converting RDDs to DataFrames

Working with Parquet

Working with Hive tables

Performance tuning and best practices

Summary

Analysis of Streaming Data Using Spark Streaming

High-level architecture

Coding our first Spark Streaming job

Querying streaming data in real time

Deployment and monitoring

Summary

Introducing Lambda Architecture

What is Lambda Architecture

The technology matrix for Lambda Architecture

Realization of Lambda Architecture

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Distributed databases (NoSQL)

We have discussed the paradigm shift from data to computation to the paradigm of computation to data in case of Hadoop. We understand on the conceptual level how to harness the power of distributed computation. The next step is to apply the same to database level in terms of having distributed databases.

In very simple terms, a database is actually a storage structure that lets us store the data in a very structured format. It can be in the form of various data structural representations internally, such as flat files, tables, blobs, and so on. Now when we talk about a database, we generally refer to single/clustered server class nodes with huge storage and specialized hardware to support the operations. So, this can be envisioned as a single unit of storage controlled by a centralized control unit.

Distributed database, on the contrary, is a database where there is no single control unit or storage unit. It's basically a cluster of homogenous/heterogeneous nodes, and the data and the control for execution and orchestration is distributed across all nodes in the cluster. So to understand it better, we can use an analogy that, instead of all the data going into a single huge box, now the data is spread across multiple boxes. The execution of this distribution, the bookkeeping and auditing of this data distribution, and the retrieval process are managed by multiple control units. In a way, there is no single point of control or storage. One important point is that these multiple distributed nodes can exist physically or virtually.

Note

Do not relate this to the concept of parallel systems, where the processors are tightly coupled and it all constitutes a single database system. A distributed database system is a relatively loosely coupled entity that shares no physical components.

Now we understand the differentiating factors for distributed databases. However, it is necessary to understand that, due to their distributed nature, these systems have some added complexity to ensure correctness and the accuracy of the day. There are two processes that play a vital role:

Replication: This is tracked by a special component of distributed databases. This piece of software does the tedious job of bookkeeping for all the updates/additions/deletions that are being made to the data. Once all changes are logged, this process updates so that all copies of the data look the same and represent the state of truth.
Duplication: The popularity of distributed databases is due to the fact that they don't have a single point of failure and there is always more than one copy of data available to deal with a failure situation if it happens. The process that copies one instance of data to multiple locations on the cluster generally executes at some defined interval.

Both these processes ensure that at a given point in time, more than one copy of data exists in the cluster, and all the copies of the data have the same representation of the state of truth for the data.

A NoSQL database environment is a non-relational and predominately distributed database system. Its clear advantage is that it facilitates the rapid analysis of extremely high-volume, disparate data types. With the advent of Big Data, NoSQL databases have become the cheap and scalable alternative to traditional RDBMS. The USPs they have to offer are availability and fault tolerance, which are big differentiating factors.

NoSQL offers a flexible and extensible schema model with added advantages of endless scalability, distributed setup, and the liberty of interfacing with non-SQL interfaces.

We can distinguish NoSQL databases as follows:

Key-value store: This type of database belongs to some of the least complex NoSQL options. Its USP is the design that allows the storage of data in a schemaless way. All the data in this store contains an index key and an associate value (as the name suggests). Popular examples of this type of database are Cassandra, DynamoDB, Azure Table Storage (ATS), Riak, Berkeley DB, and so on.
Column store or wide column store: This is designed for storing the data in rows and its data in data tables, where there are columns of data rather than rows of data, like in a standard database. They are the exact opposite of row-based databases, and this design is highly scalable and offers very high performance. Examples are HBase and Hypertable.
Document database: This is an extension to the basic idea of a key-value store where documents are more complex and elaborate. It's like each document has a unique ID associated with it. This ID is used for document retrieval. These are very widely used for storing and managing document-oriented information. Examples are MongoDB and CouchDB.
Graph database: As the name suggests, it's based on the graph theory of discreet mathematics. It's well designed for data where relationships can be maintained as a graph and elements are interconnected based on relations. Examples are Neo4j, polyglot, and so on.

The following table lays out the key attributes and certain dimensions that can be used for selecting the appropriate NoSQL databases:

Column 1: This captures the storage structure for the data model.
Column 2: This captures the performance of the distributed database on the scale of low, medium, and high.
Column 3: This captures the ease of scalability of the distributed database on the scale of low, medium, and high. It notes how easily the system can be scaled in terms of capacity and processing by adding more nodes to the cluster.
Column 4: Here we talk about the scale of flexibility of use and the ability to cater to diverse structured or unstructured data and use cases.

Column 5: Here we talk about how complex it is to work with the system in terms of the complexity of development and modeling, the complexity of operation and maintainability, and so on.

Data model	Performance	Scalability	Flexibility	Complexity
Key-value store	High	High	High	None
Column Store	High	High	Moderate	Low
Document Store	High	Variable (high)	High	Low
Graph Database	Variable	Variable	High	High

Advantages of NoSQL databases

Let's have a look at the key reasons for adopting an NoSQL database over a traditional RDBMS. Here are the key drivers that have been attributed to the shift:

Advent and growth of Big Data: This is one of the prime attribute forces driving the growth and shift towards use of NoSQL.
High availability systems: In today's highly competitive world, downtime can be deadly. The reality of business is that hardware failures will occur, but NoSQL database systems are built over a distributed architecture so there is no single point of failure. They also have replication to ensure redundancy of data that guarantees availability in the event of one or more nodes being down. With this mechanism, the availability across data centers is guaranteed, even in the case of localized failures. This all comes with guaranteed horizontal scaling and high performance.
Location independence: This refers to the ability to perform read and write operations to the data store regardless of the physical location where that input-output operation actually occurs. Similarly, we have the ability to have any write percolated out from that location. This feature is a difficult wish to make in the RDBMS world. This is a very handy tool when it comes to designing an application that services customers in many different geographies and needs to keep data local for fast access.
Schemaless data models: One of the major motivator for move to a NoSQL database system from an old-world relational database management system (RDBMS) is the ability to handle unstructured data and it's found in most NoSQL stores. The relational data model is based on strict relations defined between tables, which themselves are very strict in definition by a determined column structure. All of this then gets organized in a schema. The backbone of RDBMS is structure and it's the biggest limitation as this makes it fall short for handling and storing unstructured data that doesn't fit into strict table structure. A NoSQL data model on the contrary doesn't have any structure and it's flexible to fit in any form, so it's called schemaless. It's like one size fits all, and it's able to accept structured, semistructured, or unstructured data. All this flexibility comes along with a promise of low-cost scalability and high performance.

Choosing a NoSQL database

When it comes to making a choice, there are certain factors that can be taken into account, but the decision is still more use-case-driven and can vary from case to case. Migration of a data store or choosing a data store is an important, conscious decision, and should be made diligently and intelligently depending on the following factors:

Input data diversity
Scalability
Performance
Availability
Cost
Stability
Community

Real-Time Big Data Analytics

By : Sumit Gupta, Shilpi Saxena

Real-Time Big Data Analytics

By: Sumit Gupta, Shilpi Saxena

Overview of this book

Related Content you might be interested in

Current Title:

Real-Time Big Data Analytics

Distributed databases (NoSQL)

Note

Advantages of NoSQL databases

Choosing a NoSQL database