Real-Time Big Data Analytics

By : Sumit Gupta, Shilpi Saxena

Real-Time Big Data Analytics

By: Sumit Gupta, Shilpi Saxena

Overview of this book

Enterprise has been striving hard to deal with the challenges of data arriving in real time or near real time. Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases. From the beginning of the book, we will cover the basics of varied real-time data processing frameworks and technologies. We will discuss and explain the differences between batch and real-time processing in detail, and will also explore the techniques and programming concepts using Apache Storm. Moving on, we’ll familiarize you with “Amazon Kinesis” for real-time data processing on cloud. We will further develop your understanding of real-time analytics through a comprehensive review of Apache Spark along with the high-level architecture and the building blocks of a Spark program. You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark. At the end of this book, we will introduce Spark Streaming, the streaming library of Spark, and will walk you through the emerging Lambda Architecture (LA), which provides a hybrid platform for big data processing by combining real-time and precomputed batch data to provide a near real-time view of incoming data.

Real-Time Big Data Analytics

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Introducing the Big Data Technology Landscape and Analytics Platform

Big Data – a phenomenon

The Big Data dimensional paradigm

The Big Data ecosystem

The Big Data infrastructure

Components of the Big Data ecosystem

Distributed batch processing

Distributed databases (NoSQL)

Real-time processing

Summary

Getting Acquainted with Storm

An overview of Storm

Storm architecture and its components

How and when to use Storm

Storm internals

Summary

Processing Data with Storm

Storm input sources

Other sources for input to Storm

Reliability of data processing

Storm simple patterns

Storm persistence

Summary

Introduction to Trident and Optimizing Storm Performance

Working with Trident

Understanding LMAX

Storm internode communication

Understanding the Storm UI

Optimizing Storm performance

Summary

Getting Acquainted with Kinesis

Architectural overview of Kinesis

Creating a Kinesis streaming service

Summary

Getting Acquainted with Spark

An overview of Spark

The architecture of Spark

Resilient distributed datasets (RDD)

Writing and executing our first Spark program

Summary

Programming with RDDs

Understanding Spark transformations and actions

Programming Spark transformations and actions

Handling persistence in Spark

Summary

SQL Query Engine for Spark – Spark SQL

The architecture of Spark SQL

Coding our first Spark SQL job

Converting RDDs to DataFrames

Working with Parquet

Working with Hive tables

Performance tuning and best practices

Summary

Analysis of Streaming Data Using Spark Streaming

High-level architecture

Coding our first Spark Streaming job

Querying streaming data in real time

Deployment and monitoring

Summary

Introducing Lambda Architecture

What is Lambda Architecture

The technology matrix for Lambda Architecture

Realization of Lambda Architecture

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Index

A

Abstract Syntax Tree / The Catalyst optimizer
advanced data sources
- reference link / The components of Spark Streaming
Amazon Kinesis
- about / Benefits and use cases of Amazon Kinesis
- managed service / Benefits and use cases of Amazon Kinesis
- disruptive innovation / Benefits and use cases of Amazon Kinesis
- benefits / Benefits and use cases of Amazon Kinesis
- telecommunication / Benefits and use cases of Amazon Kinesis
- healthcare / Benefits and use cases of Amazon Kinesis
- automotive / Benefits and use cases of Amazon Kinesis
Amazon S3
- reference link / Executing Spark Streaming applications on Apache Mesos
Analytical Engine / Solution implementation
anchoring / The concept of anchoring and reliability
annotations, org.apache.spark.annotation
- DeveloperAPI / Spark packaging structure and core APIs
- Experimental / Spark packaging structure and core APIs
- AlphaComponent / Spark packaging structure and core APIs
Apache Cassandra 2.1.7
- reference link / Configuring Apache Cassandra and Spark
Apache Flume
- URL / The technology matrix for Lambda Architecture
Apache Hadoop
- URL / The emergence of Spark SQL
Apache Kafka
- URL / The technology matrix for Lambda Architecture
Apache Mesos
- about / The Spark execution model – master-worker view, Executing Spark Streaming applications on Apache Mesos
- URL / The Spark execution model – master-worker view
- reference link / Executing Spark Streaming applications on Apache Mesos
- Spark Streaming applications, executing on / Executing Spark Streaming applications on Apache Mesos
Apache Sqoop
- URL / The technology matrix for Lambda Architecture
ApplicationMaster (AM) / Executing Spark Streaming applications on Yarn
application master (AM) / The Spark execution model – master-worker view
architectural overview, Kinesis
- about / Architectural overview of Kinesis
- Amazon Kinesis, benefits / Benefits and use cases of Amazon Kinesis
- high-level architecture / High-level architecture
- components / Components of Kinesis
auto learning synchronization mechanism / Solution implementation
Avro
- reference / Schema evolution/merging
AWS SDK / Components of Kinesis
Azure Table Storage (ATS) / Distributed databases (NoSQL)

B

batch data processing
- about / Batch data processing
- use cases / Batch data processing
- challenges / Batch data processing
batch duration
- about / High-level architecture
batching
- about / Batching
- count-based batching / Batching
- time-based batching / Batching
batch mode / The emergence of Spark SQL
batch processing
- in distributed modeTopicnabout / Batch processing in distributed mode
- in distributed modeTopicncode, pushing to data / Push code to data
Big Data
- about / Big Data – a phenomenon
- dimensional paradigm / The Big Data dimensional paradigm
- infrastructure / The Big Data infrastructure
Big Data analytics architecture
- about / The Big Data analytics architecture
- business solution, building / Building business solutions
- data processing / Dataset processing
- solution implementation / Solution implementation
- presentation / Presentation
Big Data ecosystem
- about / The Big Data ecosystem
- components / Components of the Big Data ecosystem
Big Data problem statements, Lambda Architecture
- Volume / Layers/components of Lambda Architecture
- reference link / Layers/components of Lambda Architecture
- Velocity / Layers/components of Lambda Architecture
- Variety / Layers/components of Lambda Architecture
bolts
- about / Bolts
- declareStream() / Bolts
- emit() / Bolts
- InputDeclarer / Bolts
- execute() / Bolts
- IRichBolt / Bolts
- IBasicBolt / Bolts
- tasks / Tasks
- workers / Workers
Business Intelligence (BI) / The Big Data ecosystem

C

Call Data Record (CDR) / The telecoms or cellular arena
CAS (content-addressed storage) / Producers
cascading / Components of the Big Data ecosystem
Cassandra Core driver
- reference link / Configuring Apache Cassandra and Spark
Cassandra Query Language (CQL) / Configuring Apache Cassandra and Spark
Catalyst optimizer
- about / The Catalyst optimizer
- phases / The Catalyst optimizer
challenges, batch data processing
- large data / Batch data processing
- distributed processing / Batch data processing
- SLAs / Batch data processing
- fault tolerant / Batch data processing
challenges, in selecting technology for data consumption layer
- highly available / The technology matrix for Lambda Architecture
- fault tolerance / The technology matrix for Lambda Architecture
- reliability / The technology matrix for Lambda Architecture
- performance efficient / The technology matrix for Lambda Architecture
- extendable and flexible / The technology matrix for Lambda Architecture
challenges, real-time data processing
- strict SLAs / Real-time data processing
- recovering from failures / Real-time data processing
- scalable / Real-time data processing
- all in-memory / Real-time data processing
- asynchronous / Real-time data processing
cluster manager
- about / The Spark execution model – master-worker view
cluster managers, for Spark streaming
- about / Cluster managers for Spark Streaming
Coda Hale metrics library
- reference link / Monitoring Spark Streaming applications
Complex Event Processing (CEP) / Real-time processing
components, Big Data ecosystem
- about / Components of the Big Data ecosystem
components, Kinesis
- about / Components of Kinesis
- data sources / Components of Kinesis
- producers / Components of Kinesis
- consumers / Components of Kinesis
- AWS SDK / Components of Kinesis
- KPL / Components of Kinesis
- KCL / Components of Kinesis
- Kinesis streams / Components of Kinesis
- shards / Components of Kinesis
- partition keys / Components of Kinesis
- sequence numbers / Components of Kinesis
components, Spark SQL
- DataFrame API / The DataFrame API
- Catalyst optimizer / The Catalyst optimizer
- SQL/Hive contexts / SQL and Hive contexts
components, Spark Streaming
- about / The components of Spark Streaming
- input data streams / The components of Spark Streaming
- Spark streaming job / The components of Spark Streaming
- Spark core engine / The components of Spark Streaming
- output data streams / The components of Spark Streaming
components/layers, Lambda Architecture
- data sources / Layers/components of Lambda Architecture
- data consumption layer / Layers/components of Lambda Architecture
- batch layer / Layers/components of Lambda Architecture
- real-time layers / Layers/components of Lambda Architecture
- serving layers / Layers/components of Lambda Architecture
ConnectionProvider interface
- about / Storm's JDBC persistence framework
consumer group
- about / Getting to know more about Kafka
cost-based optimization / The Catalyst optimizer
CQLSH
- about / Configuring Apache Cassandra and Spark
custom connectors
- reference link / The components of Spark Streaming

D

Dashboard/Workbench / Solution implementation
Data as a Service (DaaS) / The Big Data ecosystem
DataFrame API
- about / The DataFrame API
- DataFrames and RDD / DataFrames and RDD
- user-defined functions / User-defined functions
- DataFrames and SQL / DataFrames and SQL
DataFrames
- about / Spark extensions/libraries
Data Lineage
- about / Understanding Spark transformations and actions
data mining
- about / When to use Spark – practical use cases
- reference link / When to use Spark – practical use cases
data processing
- reliability / Reliability of data processing
- anchoring / The concept of anchoring and reliability
- Storm acking framework / The Storm acking framework
dependencies
- about / Understanding Spark transformations and actions
deployment
- about / Deployment and monitoring
dimensional paradigm, Big Data
- about / The Big Data dimensional paradigm
- volume / The Big Data dimensional paradigm
- velocity / The Big Data dimensional paradigm
- variety / The Big Data dimensional paradigm
- veracity / The Big Data dimensional paradigm
- value / The Big Data dimensional paradigm
Direct Acyclic Graph (DAG)
- about / Understanding Spark transformations and actions
/ Partitioning and parallelism
directed acyclic graph (DAG)
- about / Spark packaging structure and core APIs
- reference link / Spark packaging structure and core APIs
distributed batch processing
- about / Distributed batch processing
distributed computing
- reference link / The technology matrix for Lambda Architecture
distributed databases (NoSQL)
- about / Distributed databases (NoSQL)
DoubleRDDFunctions.scala
- about / RDD APIs
- reference link / RDD APIs
DStreams
- about / High-level architecture, The components of Spark Streaming
- URL / The components of Spark Streaming
duplication / Distributed databases (NoSQL)
DynamoDB
- reference / Components of Kinesis

E

Eclipse
- installing / Eclipse
Eclipse Luna (4.4)
- download link / Eclipse
electronic publishing
- reference link / Real-time data processing
electronic trading platform
- reference link / Real-time data processing
ETL (Extract Transform Load) / Dataset processing
extensibility
- reference link / The need for Lambda Architecture
extensions/libraries, Spark
- Spark Streaming / Spark extensions/libraries
- MLlib / Spark extensions/libraries
- GraphX / Spark extensions/libraries
- Spark SQL / Spark extensions/libraries
- SparkR / Spark extensions/libraries

F

fastutil library
- URL / Memory tuning
fault tolerance
- reference link / The need for Lambda Architecture
fault tolerant
- reference link / Batch data processing
features, Lambda Architecture
- scalable / The need for Lambda Architecture
- resilient to failures / The need for Lambda Architecture
- low latency / The need for Lambda Architecture
- extensible / The need for Lambda Architecture
- maintenance / The need for Lambda Architecture
features, resilient distributed datasets (RDD)
- fault tolerance / Fault tolerance
- storage / Storage
- persistence / Persistence
- shuffling / Shuffling
features, Spark
- data storage / Apache Spark – a one-stop solution
- use cases / Apache Spark – a one-stop solution
- fault-tolerance / Apache Spark – a one-stop solution
- programming languages / Apache Spark – a one-stop solution
- hardware / Apache Spark – a one-stop solution
- management / Apache Spark – a one-stop solution
- deployment / Apache Spark – a one-stop solution
- efficiency / Apache Spark – a one-stop solution
- distributed caching / Apache Spark – a one-stop solution
- ease of use / Apache Spark – a one-stop solution
- high-level operations / Apache Spark – a one-stop solution
- API and extension / Apache Spark – a one-stop solution
- security / Apache Spark – a one-stop solution
fence instruction / Memory and cache
filtering step / Dataset processing
Flume
- URL / The components of Spark Streaming
functionalities, RDD API
- partitions / Understanding Spark transformations and actions
- splits / Understanding Spark transformations and actions
- dependencies / Understanding Spark transformations and actions
- partitioner / Understanding Spark transformations and actions
- location of splits / Understanding Spark transformations and actions
functions, resilient distributed datasets (RDD)
- saveAsTextFile(path) / Storage
- saveAsSequenceFile(path) / Storage
- saveAsObjectFile(path) / Storage

G

GraphX
- about / When to use Spark – practical use cases, Spark extensions/libraries
- reference link / When to use Spark – practical use cases, Spark extensions/libraries

H

Hadoop / The Big Data infrastructure
- reference link / Apache Spark – a one-stop solution
Hadoop 2.0
- URL / The Spark execution model – master-worker view
Hadoop 2.4.0 distribution
- URL, for downloading / Programming Spark transformations and actions
Hadoop ecosystem
- key technologies / The Big Data infrastructure
HadoopRDD
- about / RDD APIs
- reference link / RDD APIs
HDFS
- about / Components of the Big Data ecosystem
high-level architecture, Kinesis
- about / High-level architecture
high-level architecture, of SQL Streaming Crime Analyzer
- crime producer / The high-level architecture of our job
- stream consumer / The high-level architecture of our job
- Stream to DataFrame transformer / The high-level architecture of our job
high-level architecture, Spark
- about / High-level architecture
- physical machines / High-level architecture
- data storage layer / High-level architecture
- resource manager / High-level architecture
- Spark core libraries / High-level architecture
- Spark extensions/libraries / High-level architecture
high level architecture, Lambda
- data source / high-level architecture
- custom producer / high-level architecture
- real-time layer / high-level architecture
- batch layers / high-level architecture
- serving layers / high-level architecture
high level architecture, of Spark Streaming / High-level architecture
Hive / Components of the Big Data ecosystem
- URL / The emergence of Spark SQL
HiveQL
- reference / Working with Hive tables
Hive tables
- working with / Working with Hive tables

I

Infrastructure as a Service (IaaS) / The Big Data ecosystem
input data streams
- about / The components of Spark Streaming
- basic data sources / The components of Spark Streaming
- advance data sources / The components of Spark Streaming
input sources, Storm
- about / Storm input sources, Other sources for input to Storm
- Kafka / Meet Kafka, Kafka as an input source
- file / A file as an input source
- socket / A socket as an input source
installing
- Spark / Spark
- Java / Java
- Scala / Scala
- Eclipse / Eclipse
integration / Dataset processing
inter-worker communication
- about / Storm internal message processing
- workers, executing on same node / Storm internal message processing
- workers, executing across nodes / Storm internal message processing
Internet of Things (IoT)
- about / Real-time data processing
- reference link / Real-time data processing
intra-worker communication
- about / Storm internal message processing

J

Java
- installing / Java
- Spark job, coding in / Coding a Spark job in Java
- Spark Streaming job, writing in / Writing our Spark Streaming job in Java
JdbcMapper interface
- about / Storm's JDBC persistence framework
JdbcRDD
- about / RDD APIs
- reference link / RDD APIs
Joins
- about / Joins

K

Kafka
- about / Meet Kafka, Getting to know more about Kafka
- cluster / Meet Kafka
- components / Meet Kafka
- reference / Meet Kafka
- Time to live (TTL) / Getting to know more about Kafka
- topics / Getting to know more about Kafka
- consumers / Getting to know more about Kafka
- offset / Getting to know more about Kafka
- URL / The components of Spark Streaming
Key Performance Indicators (KPIs)
- about / Batch data processing
key technologies, Hadoop ecosystem
- about / The Big Data infrastructure
- Hadoop / The Big Data infrastructure
- NoSQL / The Big Data infrastructure
- MPP / The Big Data infrastructure
Kinesis
- architectural overview / Architectural overview of Kinesis
- URL / The components of Spark Streaming
Kinesis Client Library (KCL)
- about / Components of Kinesis
Kinesis Producer Library (KPL)
- about / Components of Kinesis
- retry mechanism / Components of Kinesis
- batching of records / Components of Kinesis
- aggregation / Components of Kinesis
- deaggregation / Components of Kinesis
- monitoring / Components of Kinesis
Kinesis streaming service
- creating / Creating a Kinesis streaming service
- AWS Kinesis, accessing / Access to AWS Kinesis
- development environment, configuring / Configuring the development environment
- Kinesis streams, creating / Creating Kinesis streams
- Kinesis stream producers, creating / Creating Kinesis stream producers
- Kinesis stream consumers, creating / Creating Kinesis stream consumers
- crime alerts, generating / Generating and consuming crime alerts
- crime alerts, consuming / Generating and consuming crime alerts
Kinesis stream producers
- sample dataset / Creating Kinesis stream producers
- use case / Creating Kinesis stream producers
Kryo documentation
- reference / Serialization
Kryo serialization
- reference / Serialization
Kyro
- URL / Handling persistence in Spark

L

Lambda Architecture
- about / What is Lambda Architecture
- need for / The need for Lambda Architecture
- features / The need for Lambda Architecture
- components/layers / Layers/components of Lambda Architecture
- Big Data problem statements / Layers/components of Lambda Architecture
- technology matrix / The technology matrix for Lambda Architecture
- realization / Realization of Lambda Architecture
least-recently-used (LRU) / Handling persistence in Spark
LMAX
- about / Understanding LMAX
- memory / Memory and cache
- cache / Memory and cache
- ring buffer / Ring buffer – the heart of the disruptor
LMAX Disruptor / Storm internal message processing
log analysis
- reference link / Batch data processing
Logstash
- URL / The technology matrix for Lambda Architecture

M

MapReduce / Components of the Big Data ecosystem
- URL / The emergence of Spark SQL
Massively Parallel Processing (MPP) / The Big Data infrastructure
membar / Memory and cache
memory barrier / Memory and cache
memory fence / Memory and cache
memory tuning
- about / Memory tuning
- garbage collection / Memory tuning
- object sizes / Memory tuning
- executor memory / Memory tuning
Mesos
- URL / High-level architecture
Message Processing Interface (MPI) / Batch processing in distributed mode
microbatches / High-level architecture
MLlib
- reference link / When to use Spark – practical use cases, Spark extensions/libraries
- about / When to use Spark – practical use cases, Spark extensions/libraries
modes, YARN
- YARN client mode / The Spark execution model – master-worker view
- YARN cluster mode / The Spark execution model – master-worker view
monitoring
- about / Deployment and monitoring
MultiLangDaemon interface / Components of Kinesis

N

near real-time (NRT) systems
- about / Real-time data processing
Netty
- about / Netty
NewHadoopRDD
- reference link / RDD APIs
Nimbus
- about / A Storm cluster
/ Optimizing Storm performance
node manager (NM) / The Spark execution model – master-worker view
NodeManager (NM) / Executing Spark Streaming applications on Yarn
NoSQL / The Big Data infrastructure
NoSQL databases
- advantages / Advantages of NoSQL databases
- choice / Choosing a NoSQL database
NoSQL databases, distinguishing
- key-value store / Distributed databases (NoSQL)
- column store / Distributed databases (NoSQL)
- wide column store / Distributed databases (NoSQL)
- document database / Distributed databases (NoSQL)
- graph database / Distributed databases (NoSQL)

O

operations, RDD API
- reference link / RDD APIs
Oracle Java 7
- download link / Java
OrderedRDDFunctions
- about / RDD APIs
- reference link / RDD APIs
org.apache.spark.streaming.dstream.DStream.scala / Spark Streaming APIs
org.apache.spark.streaming.flume.*
- reference link / Spark Streaming APIs
org.apache.spark.streaming.kafka.*
- reference link / Spark Streaming APIs
org.apache.spark.streaming.kinesis.*
- reference link / Spark Streaming APIs
org.apache.spark.streaming.StreamingContext / Spark Streaming APIs
org.apache.spark.streaming.twitter.*
- reference link / Spark Streaming APIs
org.apache.spark.streaming.zeromq.*
- reference link / Spark Streaming APIs
or Illinois Uniform Crime Reporting (IUCR) / Programming Spark transformations and actions
output data streams
- about / The components of Spark Streaming
output operations, DStreams
- print() / Spark Streaming operations
- saveAsTextFiles(prefix, suffix) / Spark Streaming operations
- saveAsObjectFiles(prefix, suffix) / Spark Streaming operations
- saveAsHadoopFiles(prefix, suffix) / Spark Streaming operations
- foreachRDD(func) / Spark Streaming operations

P

packaging structure, Spark Streaming
- about / The packaging structure of Spark Streaming
- Spark Streaming APIs / Spark Streaming APIs
- Spark Streaming operations / Spark Streaming operations
PairRDDFunctions
- about / RDD APIs
- reference link / RDD APIs
Parquet
- about / Working with Parquet
- working with / Working with Parquet
- URL / Working with Parquet
- data, persisting in HDFS / Persisting Parquet data in HDFS
partitioner
- about / Understanding Spark transformations and actions
partitioning
- about / Partitioning and schema evolution or merging , Partitioning
- reference / Partitioning
partition keys
- about / Components of Kinesis
partitions
- about / Understanding Spark transformations and actions
/ Partitioning and parallelism
performance tuning
- about / Performance tuning and best practices
- partitioning / Partitioning and parallelism
- parallelism / Partitioning and parallelism
- serialization / Serialization
- caching / Caching
- memory tuning / Memory tuning
persistence
- handling, in Spark / Handling persistence in Spark
phases, Spark SQL
- analysis / The Catalyst optimizer
- logical optimization / The Catalyst optimizer
- physical planning / The Catalyst optimizer
- code generation / The Catalyst optimizer
Pig / Components of the Big Data ecosystem
- URL / The emergence of Spark SQL
practical use cases, Spark
- batch processing / When to use Spark – practical use cases
- streaming / When to use Spark – practical use cases
- data mining / When to use Spark – practical use cases
- MLlib / When to use Spark – practical use cases
- graph computing / When to use Spark – practical use cases
- GraphX / When to use Spark – practical use cases
- interactive analysis / When to use Spark – practical use cases
ProtocolBuffer
- reference / Schema evolution/merging
pub-sub
- about / Getting to know more about Kafka

Q

quasiquotes
- reference / The Catalyst optimizer
queue
- about / Getting to know more about Kafka

R

RabbitMQ
- URL / The technology matrix for Lambda Architecture
RandomSentenceSpout
- about / How and when to use Storm
RDD
- converting, to DataFrames / Converting RDDs to DataFrames
- automated process / Converting RDDs to DataFrames, Automated process
- manual process / Converting RDDs to DataFrames, The manual process
RDD.scala
- about / RDD APIs
RDD action operations
- about / RDD action operations
- reduce(func) / RDD action operations
- collect() / RDD action operations
- count() / RDD action operations
- countApproxDistinct(relativeSD* Double = 0.05) / RDD action operations
- countByKey() / RDD action operations
- first() / RDD action operations
- take(n) / RDD action operations
- takeSample(withReplacement, num, [seed]) / RDD action operations
- takeOrdered(Int*num) / RDD action operations
- saveAsTextFile (pathTopicn String) / RDD action operations
- saveAsSequenceFile (path* String) / RDD action operations
- saveAsObjectFile (path* String) / RDD action operations
RDD API
- functionalities / Understanding Spark transformations and actions
RDD APIs
- about / RDD APIs
- RDD.scala / RDD APIs
- DoubleRDDFunctions.scala / RDD APIs
- HadoopRDD / RDD APIs
- JdbcRDD / RDD APIs
- PairRDDFunctions / RDD APIs
- OrderedRDDFunctions / RDD APIs
- SequenceFileRDDFunctions / RDD APIs
RDD transformation operations
- about / RDD transformation operations
- filter(filterFunc) / RDD transformation operations
- map(mapFunc) / RDD transformation operations
- flatMap(flatMapFunc) / RDD transformation operations
- mapPartitions(mapPartFunc, preservePartitioning) / RDD transformation operations
- distinct() / RDD transformation operations
- union(otherDataset) / RDD transformation operations
- intersection(otherDataset) / RDD transformation operations
- groupByKey([numTasks]) / RDD transformation operations
- reduceByKey(func, [numTasks]) / RDD transformation operations
- coalesce(numPartitions) / RDD transformation operations
- sortBy (f,[ascending], [numTasks]) / RDD transformation operations
- sortByKey([ascending], [numTasks]) / RDD transformation operations
- repartition(numPartitions) / RDD transformation operations
- join(otherDataset, [numTasks]) / RDD transformation operations
real-time (RT) systems / Real-time data processing
real-time data processing
- about / Real-time data processing
- use cases / Real-time data processing
- challenges / Real-time data processing
real-time processing
- about / Real-time processing
- telecom or cellular arena / The telecoms or cellular arena
- transportation and logistics / Transportation and logistics
- connected vehicle / The connected vehicle
- financial sector / The financial sector
realization, of Lambda Architecture
- about / Realization of Lambda Architecture
- high level architecture / high-level architecture
- Apache Cassandra, configuring / Configuring Apache Cassandra and Spark
- Spark, configuring / Configuring Apache Cassandra and Spark
- custom producer, coding / Coding the custom producer
- real-time layers, coding / Coding the real-time layer
- batch layers, coding / Coding the batch layer
- serving layers, coding / Coding the serving layer
- layers, executing / Executing all the layers
Redshift
- reference / Components of Kinesis
reduce functionality
- reference link / RDD action operations
Relational Database Management Systems (RDBMS) / The emergence of Spark SQL
relaxed SLAs
- about / Batch data processing
replication / Distributed databases (NoSQL)
resilient distributed datasets (RDD)
- about / The architecture of Spark, The Spark execution model – master-worker view, Resilient distributed datasets (RDD)
- features / RDD – by definition
- functions / Storage
Resilient Distributed Datasets (RDD)
- about / Understanding Spark transformations and actions
Resilient Distributed Datasets (RDDs)
- reference link / Shuffling
- about / High-level architecture
resource manager
- about / The Spark execution model – master-worker view
resource manager (RM) / The Spark execution model – master-worker view
ResourceManager (RM) / Executing Spark Streaming applications on Yarn
resource managers, Spark
- Apache Mesos / The Spark execution model – master-worker view
- Hadoop YARN / The Spark execution model – master-worker view
- standalone mode / The Spark execution model – master-worker view
- local mode / The Spark execution model – master-worker view
ring buffer
- about / Ring buffer – the heart of the disruptor
- producers / Producers
- consumers / Consumers
rule based optimizations / The Catalyst optimizer

S

S3
- reference / Components of Kinesis
Scala
- reference link / Spark packaging structure and core APIs
- installing / Scala
- Spark job, coding in / Coding a Spark job in Scala
- Spark Streaming job, writing in / Writing our Spark Streaming job in Scala
Scala 2.10.5 compressed tarball
- download link / Scala
Scala APIs, by Spark Core
- org.apache.spark / Spark packaging structure and core APIs
- org.apache.spark.SparkContext / Spark packaging structure and core APIs
- org.apache.spark.rdd.RDD.scala / Spark packaging structure and core APIs
- org.apache.spark.annotation / Spark packaging structure and core APIs
- org.apache.spark.broadcast / Spark packaging structure and core APIs
- HttpBroadcast / Spark packaging structure and core APIs
- TorrentBroadcast / Spark packaging structure and core APIs
- org.apache.spark.io / Spark packaging structure and core APIs
- org.apache.spark.scheduler / Spark packaging structure and core APIs
- org.apache.spark.storage / Spark packaging structure and core APIs
- org.apache.spark.util / Spark packaging structure and core APIs
scalability
- reference link / Batch data processing, The need for Lambda Architecture
schema evolution
- about / Schema evolution/merging
schema merging
- about / Schema evolution/merging
SequenceFileRDDFunctions
- about / RDD APIs
- reference link / RDD APIs
serialization process
- URL / Handling persistence in Spark
shards
- about / Components of Kinesis
- for reads / Components of Kinesis
- for writes / Components of Kinesis
single point of failure (SPOF) / The need for Lambda Architecture
SLAs
- about / Batch data processing
smart traversing
- about / Ring buffer – the heart of the disruptor
software development kit (SDK) / Components of Kinesis
Spark
- overview / An overview of Spark
- about / Apache Spark – a one-stop solution
- features / Apache Spark – a one-stop solution
- practical use cases / When to use Spark – practical use cases
- packaging structure / Spark packaging structure and core APIs
- core APIs / Spark packaging structure and core APIs
- hardware requisites / Hardware requirements
- installing / Spark
- persistence handling / Handling persistence in Spark
- storage levels / Handling persistence in Spark
Spark-Cassandra connector
- reference link / Configuring Apache Cassandra and Spark
Spark-Cassandra Java library
- reference link / Configuring Apache Cassandra and Spark
Spark 1.4.0
- download link / Configuring Apache Cassandra and Spark
Spark actions
- about / Understanding Spark transformations and actions
- programming / Programming Spark transformations and actions
Spark architecture
- about / The architecture of Spark
- high-level architecture / High-level architecture
Spark cluster
- configuring / Configuring the Spark cluster
Spark compressed tarball
- download link / Spark
Spark Core
- about / Spark packaging structure and core APIs
Spark core engine
- about / The components of Spark Streaming
Spark driver
- about / The Spark execution model – master-worker view
Spark execution model
- about / Spark packaging structure and core APIs
Spark extensions
- about / Spark packaging structure and core APIs
Spark framework
- error / Working with Parquet
- overwrite / Working with Parquet
- append / Working with Parquet
- ignore / Working with Parquet
Spark job
- coding, in Scala / Coding a Spark job in Scala
- coding, in Java / Coding a Spark job in Java
Spark master
- about / The Spark execution model – master-worker view
Spark packages
- reference link / Spark extensions/libraries
SparkR
- about / Spark extensions/libraries
- reference link / Spark extensions/libraries
Spark SQL
- reference link / Spark extensions/libraries
- phases / The Catalyst optimizer
SPARK SQL
- architecture / The architecture of Spark SQL
- emergence / The emergence of Spark SQL
- about / The emergence of Spark SQL
- features / The emergence of Spark SQL
- components / The components of Spark SQL
- DataFrame API / The components of Spark SQL
- catalyst optimizer / The components of Spark SQL
Spark SQL job
- coding / Coding our first Spark SQL job
- reference / Coding our first Spark SQL job
- coding, in Scala / Coding a Spark SQL job in Scala
- coding, in Java / Coding a Spark SQL job in Java
Spark Steaming job
- coding / Coding our first Spark Streaming job
Spark Streaming
- reference link / When to use Spark – practical use cases, Spark extensions/libraries
- about / Spark extensions/libraries
- high level architecture / High-level architecture
- components / The components of Spark Streaming
- packaging structure / The packaging structure of Spark Streaming
Spark Streaming APIs
- about / Spark Streaming APIs
- reference link / Spark Streaming APIs
Spark Streaming applications
- executing, on YARN / Executing Spark Streaming applications on Yarn
- executing, on Apache Mesos / Executing Spark Streaming applications on Apache Mesos
- monitoring / Monitoring Spark Streaming applications
- reference link / Monitoring Spark Streaming applications
Spark Streaming job
- writing, in Scala / Writing our Spark Streaming job in Scala
- writing, in Java / Writing our Spark Streaming job in Java
- executing / Executing our Spark Streaming job
Spark streaming job
- about / The components of Spark Streaming
- data receiver / The components of Spark Streaming
- batches / The components of Spark Streaming
- DStreams / The components of Spark Streaming
- streaming contexts / The components of Spark Streaming
Spark Streaming operations
- about / Spark Streaming operations
Spark transformation
- about / Understanding Spark transformations and actions
- programming / Programming Spark transformations and actions
Spark UI
- workers / Configuring the Spark cluster
- running applications / Configuring the Spark cluster
- completed application / Configuring the Spark cluster
Spark worker/executors
- about / The Spark execution model – master-worker view
speed layers
- about / Layers/components of Lambda Architecture
splits
- about / Understanding Spark transformations and actions
spout collector / The concept of anchoring and reliability
SQL Streaming Crime Analyzer
- high-level architecture / The high-level architecture of our job
- crime producer, coding / Coding the crime producer
- stream consumer, coding / Coding the stream consumer and transformer
- stream transformer, coding / Coding the stream consumer and transformer
- executing / Executing the SQL Streaming Crime Analyzer
standalone resource manager
- about / Configuring the Spark cluster
StorageLevel class
- reference link / Persistence
storage levels, Spark
- StorageLevel.MEMORY_ONLY / Handling persistence in Spark
- StorageLevel.MEMORY_ONLY_SER / Handling persistence in Spark
- StorageLevel.MEMORY_AND_DISK / Handling persistence in Spark
- StorageLevel.MEMORY_AND_DISK_SER / Handling persistence in Spark
- StorageLevel.DISK_ONLY / Handling persistence in Spark
- StorageLevel.MEMORY_ONLY_2, MEMORY_AND_DISK_2 / Handling persistence in Spark
- StorageLevel.OFF_HEAP / Handling persistence in Spark
Storm
- about / Real-time processing
- overview / An overview of Storm
- journey / The journey of Storm
- performance / The journey of Storm
- scalability / The journey of Storm
- fail safe / The journey of Storm
- reliability / The journey of Storm
- easy / The journey of Storm
- open source / The journey of Storm
- abstractions / Storm abstractions
- architecture / Storm architecture and its components
- components / Storm architecture and its components
- local mode / Storm architecture and its components
- distributed mode / Storm architecture and its components
- reference / Storm architecture and its components
- using / How and when to use Storm
- input sources / Storm input sources
- performance, optimizing / Optimizing Storm performance
- reference link / Apache Spark – a one-stop solution
Storm abstractions
- stream / Streams
- topology / Topology
- spout / Spouts
- bolts / Bolts
Storm acking framework
- about / The Storm acking framework
Storm cluster
- about / A Storm cluster
- Nimbus / A Storm cluster
- Supervisors / A Storm cluster
- UI / A Storm cluster
Storm internal message processing
- about / Storm internal message processing
- inter-worker communication / Storm internal message processing
- intra-worker communication / Storm internal message processing
Storm internals
- about / Storm internals
- Storm parallelism / Storm parallelism
- Storm internal message processing / Storm internal message processing
Storm internode communication
- about / Storm internode communication
- ZeroMQ / ZeroMQ
- Netty / Netty
Storm parallelism
- about / Storm parallelism
- worker process / Storm parallelism
- executors / Storm parallelism
- tasks / Storm parallelism
Storm persistence
- about / Storm persistence
- JDBC persistence framework / Storm's JDBC persistence framework
Storm simple patterns
- about / Storm simple patterns
- Joins / Joins
- batching / Batching
Storm UI
- about / Understanding the Storm UI
- landing page / Storm UI landing page
- topology home page / Topology home page
StreamingContext
- URL / The components of Spark Streaming
streaming data
- querying / Querying streaming data in real time
stream producer
- creating / Creating a stream producer
Supervisors
- about / A Storm cluster
- workers / A Storm cluster
- executors / A Storm cluster
- tasks / A Storm cluster
/ Optimizing Storm performance

T

Tachyon
- URL / Apache Spark – a one-stop solution, Handling persistence in Spark
Taychon
- URL / Handling persistence in Spark
TextInputFormat
- reference link / Understanding Spark transformations and actions
Thrift
- reference / Schema evolution/merging
transformation / Dataset processing
transformation operations, on input streams
- reference link / Spark Streaming operations
transformation operations, on streaming data
- windowing operations / Spark Streaming operations
- transform operations / Spark Streaming operations
- updateStateByKey Operation / Spark Streaming operations
- output operations / Spark Streaming operations
Trident
- working with / Working with Trident
- transactions / Transactions
- topology / Trident topology
- operations / Trident operations
Trident operations
- about / Trident operations
- merging / Merging and joining
- joining / Merging and joining
- filter / Filter, Function
- aggregation / Aggregation
- grouping / Grouping
- state maintenance / State maintenance
Trident topology
- about / Trident topology
- Trident tuples / Trident tuples
- Trident spout / Trident spout
troubleshooting tips
- about / Troubleshooting – tips and tricks
- port numbers, used by Spark / Port numbers used by Spark
- classpath issues / Classpath issues – class not found exception
- other common exceptions / Other common exceptions

U

use cases, for batch data processing
- log analysis/analytics / Batch data processing
- predictive maintenance / Batch data processing
- faster claim processing / Batch data processing
- pricing analytics / Batch data processing
use cases, real-time data processing
- Internet of Things (IoT) / Real-time data processing
- online trading systems / Real-time data processing
- online publishing / Real-time data processing
- assembly lines / Real-time data processing
- online gaming systems / Real-time data processing

W

WordCountTopology
- about / How and when to use Storm
Write Ahead Logs (WAL) / The technology matrix for Lambda Architecture

Y

YARN
- URL / High-level architecture
- modes / The Spark execution model – master-worker view
- Spark Streaming applications, executing on / Executing Spark Streaming applications on Yarn
- reference link / Executing Spark Streaming applications on Yarn
YARN client mode / The Spark execution model – master-worker view
YARN cluster mode / The Spark execution model – master-worker view
Yet Another Resource Negotiator (YARN) / Batch processing in distributed mode

Z

ZeroMQ
- about / ZeroMQ
- Storm ZeroMQ configurations / Storm ZeroMQ configurations
ZooKeeper / Optimizing Storm performance
Zookeeper
- about / A Storm cluster
Zookeeper cluster
- about / A Zookeeper cluster

Real-Time Big Data Analytics

By : Sumit Gupta, Shilpi Saxena

Real-Time Big Data Analytics

By: Sumit Gupta, Shilpi Saxena

Overview of this book

Related Content you might be interested in

Current Title:

Real-Time Big Data Analytics

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

W

Y

Z