Index
A
- Abstract Syntax Tree / The Catalyst optimizer
- advanced data sources
- reference link / The components of Spark Streaming
- Amazon Kinesis
- about / Benefits and use cases of Amazon Kinesis
- managed service / Benefits and use cases of Amazon Kinesis
- disruptive innovation / Benefits and use cases of Amazon Kinesis
- benefits / Benefits and use cases of Amazon Kinesis
- telecommunication / Benefits and use cases of Amazon Kinesis
- healthcare / Benefits and use cases of Amazon Kinesis
- automotive / Benefits and use cases of Amazon Kinesis
- Amazon S3
- reference link / Executing Spark Streaming applications on Apache Mesos
- Analytical Engine / Solution implementation
- anchoring / The concept of anchoring and reliability
- annotations, org.apache.spark.annotation
- DeveloperAPI / Spark packaging structure and core APIs
- Experimental / Spark packaging structure and core APIs
- AlphaComponent / Spark packaging structure and core APIs
- Apache Cassandra 2.1.7
- reference link / Configuring Apache Cassandra and Spark
- Apache Flume
- Apache Hadoop
- Apache Kafka
- Apache Mesos
- about / The Spark execution model – master-worker view, Executing Spark Streaming applications on Apache Mesos
- URL / The Spark execution model – master-worker view
- reference link / Executing Spark Streaming applications on Apache Mesos
- Spark Streaming applications, executing on / Executing Spark Streaming applications on Apache Mesos
- Apache Sqoop
- ApplicationMaster (AM) / Executing Spark Streaming applications on Yarn
- application master (AM) / The Spark execution model – master-worker view
- architectural overview, Kinesis
- about / Architectural overview of Kinesis
- Amazon Kinesis, benefits / Benefits and use cases of Amazon Kinesis
- high-level architecture / High-level architecture
- components / Components of Kinesis
- auto learning synchronization mechanism / Solution implementation
- Avro
- reference / Schema evolution/merging
- AWS SDK / Components of Kinesis
- Azure Table Storage (ATS) / Distributed databases (NoSQL)
B
- batch data processing
- about / Batch data processing
- use cases / Batch data processing
- challenges / Batch data processing
- batch duration
- about / High-level architecture
- batching
- batch mode / The emergence of Spark SQL
- batch processing
- in distributed modeTopicnabout / Batch processing in distributed mode
- in distributed modeTopicncode, pushing to data / Push code to data
- Big Data
- about / Big Data – a phenomenon
- dimensional paradigm / The Big Data dimensional paradigm
- infrastructure / The Big Data infrastructure
- Big Data analytics architecture
- about / The Big Data analytics architecture
- business solution, building / Building business solutions
- data processing / Dataset processing
- solution implementation / Solution implementation
- presentation / Presentation
- Big Data ecosystem
- about / The Big Data ecosystem
- components / Components of the Big Data ecosystem
- Big Data problem statements, Lambda Architecture
- Volume / Layers/components of Lambda Architecture
- reference link / Layers/components of Lambda Architecture
- Velocity / Layers/components of Lambda Architecture
- Variety / Layers/components of Lambda Architecture
- bolts
- Business Intelligence (BI) / The Big Data ecosystem
C
- Call Data Record (CDR) / The telecoms or cellular arena
- CAS (content-addressed storage) / Producers
- cascading / Components of the Big Data ecosystem
- Cassandra Core driver
- reference link / Configuring Apache Cassandra and Spark
- Cassandra Query Language (CQL) / Configuring Apache Cassandra and Spark
- Catalyst optimizer
- about / The Catalyst optimizer
- phases / The Catalyst optimizer
- challenges, batch data processing
- large data / Batch data processing
- distributed processing / Batch data processing
- SLAs / Batch data processing
- fault tolerant / Batch data processing
- challenges, in selecting technology for data consumption layer
- highly available / The technology matrix for Lambda Architecture
- fault tolerance / The technology matrix for Lambda Architecture
- reliability / The technology matrix for Lambda Architecture
- performance efficient / The technology matrix for Lambda Architecture
- extendable and flexible / The technology matrix for Lambda Architecture
- challenges, real-time data processing
- strict SLAs / Real-time data processing
- recovering from failures / Real-time data processing
- scalable / Real-time data processing
- all in-memory / Real-time data processing
- asynchronous / Real-time data processing
- cluster manager
- cluster managers, for Spark streaming
- Coda Hale metrics library
- reference link / Monitoring Spark Streaming applications
- Complex Event Processing (CEP) / Real-time processing
- components, Big Data ecosystem
- components, Kinesis
- about / Components of Kinesis
- data sources / Components of Kinesis
- producers / Components of Kinesis
- consumers / Components of Kinesis
- AWS SDK / Components of Kinesis
- KPL / Components of Kinesis
- KCL / Components of Kinesis
- Kinesis streams / Components of Kinesis
- shards / Components of Kinesis
- partition keys / Components of Kinesis
- sequence numbers / Components of Kinesis
- components, Spark SQL
- DataFrame API / The DataFrame API
- Catalyst optimizer / The Catalyst optimizer
- SQL/Hive contexts / SQL and Hive contexts
- components, Spark Streaming
- about / The components of Spark Streaming
- input data streams / The components of Spark Streaming
- Spark streaming job / The components of Spark Streaming
- Spark core engine / The components of Spark Streaming
- output data streams / The components of Spark Streaming
- components/layers, Lambda Architecture
- data sources / Layers/components of Lambda Architecture
- data consumption layer / Layers/components of Lambda Architecture
- batch layer / Layers/components of Lambda Architecture
- real-time layers / Layers/components of Lambda Architecture
- serving layers / Layers/components of Lambda Architecture
- ConnectionProvider interface
- consumer group
- about / Getting to know more about Kafka
- cost-based optimization / The Catalyst optimizer
- CQLSH
- custom connectors
- reference link / The components of Spark Streaming
D
- Dashboard/Workbench / Solution implementation
- Data as a Service (DaaS) / The Big Data ecosystem
- DataFrame API
- about / The DataFrame API
- DataFrames and RDD / DataFrames and RDD
- user-defined functions / User-defined functions
- DataFrames and SQL / DataFrames and SQL
- DataFrames
- about / Spark extensions/libraries
- Data Lineage
- data mining
- about / When to use Spark – practical use cases
- reference link / When to use Spark – practical use cases
- data processing
- reliability / Reliability of data processing
- anchoring / The concept of anchoring and reliability
- Storm acking framework / The Storm acking framework
- dependencies
- deployment
- about / Deployment and monitoring
- dimensional paradigm, Big Data
- about / The Big Data dimensional paradigm
- volume / The Big Data dimensional paradigm
- velocity / The Big Data dimensional paradigm
- variety / The Big Data dimensional paradigm
- veracity / The Big Data dimensional paradigm
- value / The Big Data dimensional paradigm
- Direct Acyclic Graph (DAG) / Partitioning and parallelism
- directed acyclic graph (DAG)
- about / Spark packaging structure and core APIs
- reference link / Spark packaging structure and core APIs
- distributed batch processing
- about / Distributed batch processing
- distributed computing
- reference link / The technology matrix for Lambda Architecture
- distributed databases (NoSQL)
- about / Distributed databases (NoSQL)
- DoubleRDDFunctions.scala
- DStreams
- duplication / Distributed databases (NoSQL)
- DynamoDB
- reference / Components of Kinesis
E
- Eclipse
- installing / Eclipse
- Eclipse Luna (4.4)
- download link / Eclipse
- electronic publishing
- reference link / Real-time data processing
- electronic trading platform
- reference link / Real-time data processing
- ETL (Extract Transform Load) / Dataset processing
- extensibility
- reference link / The need for Lambda Architecture
- extensions/libraries, Spark
- Spark Streaming / Spark extensions/libraries
- MLlib / Spark extensions/libraries
- GraphX / Spark extensions/libraries
- Spark SQL / Spark extensions/libraries
- SparkR / Spark extensions/libraries
F
- fastutil library
- URL / Memory tuning
- fault tolerance
- reference link / The need for Lambda Architecture
- fault tolerant
- reference link / Batch data processing
- features, Lambda Architecture
- scalable / The need for Lambda Architecture
- resilient to failures / The need for Lambda Architecture
- low latency / The need for Lambda Architecture
- extensible / The need for Lambda Architecture
- maintenance / The need for Lambda Architecture
- features, resilient distributed datasets (RDD)
- fault tolerance / Fault tolerance
- storage / Storage
- persistence / Persistence
- shuffling / Shuffling
- features, Spark
- data storage / Apache Spark – a one-stop solution
- use cases / Apache Spark – a one-stop solution
- fault-tolerance / Apache Spark – a one-stop solution
- programming languages / Apache Spark – a one-stop solution
- hardware / Apache Spark – a one-stop solution
- management / Apache Spark – a one-stop solution
- deployment / Apache Spark – a one-stop solution
- efficiency / Apache Spark – a one-stop solution
- distributed caching / Apache Spark – a one-stop solution
- ease of use / Apache Spark – a one-stop solution
- high-level operations / Apache Spark – a one-stop solution
- API and extension / Apache Spark – a one-stop solution
- security / Apache Spark – a one-stop solution
- fence instruction / Memory and cache
- filtering step / Dataset processing
- Flume
- functionalities, RDD API
- partitions / Understanding Spark transformations and actions
- splits / Understanding Spark transformations and actions
- dependencies / Understanding Spark transformations and actions
- partitioner / Understanding Spark transformations and actions
- location of splits / Understanding Spark transformations and actions
- functions, resilient distributed datasets (RDD)
G
- GraphX
H
- Hadoop / The Big Data infrastructure
- reference link / Apache Spark – a one-stop solution
- Hadoop 2.0
- Hadoop 2.4.0 distribution
- URL, for downloading / Programming Spark transformations and actions
- Hadoop ecosystem
- key technologies / The Big Data infrastructure
- HadoopRDD
- HDFS
- high-level architecture, Kinesis
- about / High-level architecture
- high-level architecture, of SQL Streaming Crime Analyzer
- crime producer / The high-level architecture of our job
- stream consumer / The high-level architecture of our job
- Stream to DataFrame transformer / The high-level architecture of our job
- high-level architecture, Spark
- about / High-level architecture
- physical machines / High-level architecture
- data storage layer / High-level architecture
- resource manager / High-level architecture
- Spark core libraries / High-level architecture
- Spark extensions/libraries / High-level architecture
- high level architecture, Lambda
- data source / high-level architecture
- custom producer / high-level architecture
- real-time layer / high-level architecture
- batch layers / high-level architecture
- serving layers / high-level architecture
- high level architecture, of Spark Streaming / High-level architecture
- Hive / Components of the Big Data ecosystem
- HiveQL
- reference / Working with Hive tables
- Hive tables
- working with / Working with Hive tables
I
- Infrastructure as a Service (IaaS) / The Big Data ecosystem
- input data streams
- about / The components of Spark Streaming
- basic data sources / The components of Spark Streaming
- advance data sources / The components of Spark Streaming
- input sources, Storm
- about / Storm input sources, Other sources for input to Storm
- Kafka / Meet Kafka, Kafka as an input source
- file / A file as an input source
- socket / A socket as an input source
- installing
- integration / Dataset processing
- inter-worker communication
- about / Storm internal message processing
- workers, executing on same node / Storm internal message processing
- workers, executing across nodes / Storm internal message processing
- Internet of Things (IoT)
- about / Real-time data processing
- reference link / Real-time data processing
- intra-worker communication
J
- Java
- installing / Java
- Spark job, coding in / Coding a Spark job in Java
- Spark Streaming job, writing in / Writing our Spark Streaming job in Java
- JdbcMapper interface
- JdbcRDD
- Joins
- about / Joins
K
- Kafka
- about / Meet Kafka, Getting to know more about Kafka
- cluster / Meet Kafka
- components / Meet Kafka
- reference / Meet Kafka
- Time to live (TTL) / Getting to know more about Kafka
- topics / Getting to know more about Kafka
- consumers / Getting to know more about Kafka
- offset / Getting to know more about Kafka
- URL / The components of Spark Streaming
- Key Performance Indicators (KPIs)
- about / Batch data processing
- key technologies, Hadoop ecosystem
- about / The Big Data infrastructure
- Hadoop / The Big Data infrastructure
- NoSQL / The Big Data infrastructure
- MPP / The Big Data infrastructure
- Kinesis
- architectural overview / Architectural overview of Kinesis
- URL / The components of Spark Streaming
- Kinesis Client Library (KCL)
- about / Components of Kinesis
- Kinesis Producer Library (KPL)
- about / Components of Kinesis
- retry mechanism / Components of Kinesis
- batching of records / Components of Kinesis
- aggregation / Components of Kinesis
- deaggregation / Components of Kinesis
- monitoring / Components of Kinesis
- Kinesis streaming service
- creating / Creating a Kinesis streaming service
- AWS Kinesis, accessing / Access to AWS Kinesis
- development environment, configuring / Configuring the development environment
- Kinesis streams, creating / Creating Kinesis streams
- Kinesis stream producers, creating / Creating Kinesis stream producers
- Kinesis stream consumers, creating / Creating Kinesis stream consumers
- crime alerts, generating / Generating and consuming crime alerts
- crime alerts, consuming / Generating and consuming crime alerts
- Kinesis stream producers
- sample dataset / Creating Kinesis stream producers
- use case / Creating Kinesis stream producers
- Kryo documentation
- reference / Serialization
- Kryo serialization
- reference / Serialization
- Kyro
L
- Lambda Architecture
- about / What is Lambda Architecture
- need for / The need for Lambda Architecture
- features / The need for Lambda Architecture
- components/layers / Layers/components of Lambda Architecture
- Big Data problem statements / Layers/components of Lambda Architecture
- technology matrix / The technology matrix for Lambda Architecture
- realization / Realization of Lambda Architecture
- least-recently-used (LRU) / Handling persistence in Spark
- LMAX
- about / Understanding LMAX
- memory / Memory and cache
- cache / Memory and cache
- ring buffer / Ring buffer – the heart of the disruptor
- LMAX Disruptor / Storm internal message processing
- log analysis
- reference link / Batch data processing
- Logstash
M
- MapReduce / Components of the Big Data ecosystem
- Massively Parallel Processing (MPP) / The Big Data infrastructure
- membar / Memory and cache
- memory barrier / Memory and cache
- memory fence / Memory and cache
- memory tuning
- about / Memory tuning
- garbage collection / Memory tuning
- object sizes / Memory tuning
- executor memory / Memory tuning
- Mesos
- URL / High-level architecture
- Message Processing Interface (MPI) / Batch processing in distributed mode
- microbatches / High-level architecture
- MLlib
- modes, YARN
- YARN client mode / The Spark execution model – master-worker view
- YARN cluster mode / The Spark execution model – master-worker view
- monitoring
- about / Deployment and monitoring
- MultiLangDaemon interface / Components of Kinesis
N
- near real-time (NRT) systems
- about / Real-time data processing
- Netty
- about / Netty
- NewHadoopRDD
- reference link / RDD APIs
- Nimbus
- about / A Storm cluster
- node manager (NM) / The Spark execution model – master-worker view
- NodeManager (NM) / Executing Spark Streaming applications on Yarn
- NoSQL / The Big Data infrastructure
- NoSQL databases
- advantages / Advantages of NoSQL databases
- choice / Choosing a NoSQL database
- NoSQL databases, distinguishing
- key-value store / Distributed databases (NoSQL)
- column store / Distributed databases (NoSQL)
- wide column store / Distributed databases (NoSQL)
- document database / Distributed databases (NoSQL)
- graph database / Distributed databases (NoSQL)
O
- operations, RDD API
- reference link / RDD APIs
- Oracle Java 7
- download link / Java
- OrderedRDDFunctions
- org.apache.spark.streaming.dstream.DStream.scala / Spark Streaming APIs
- org.apache.spark.streaming.flume.*
- reference link / Spark Streaming APIs
- org.apache.spark.streaming.kafka.*
- reference link / Spark Streaming APIs
- org.apache.spark.streaming.kinesis.*
- reference link / Spark Streaming APIs
- org.apache.spark.streaming.StreamingContext / Spark Streaming APIs
- org.apache.spark.streaming.twitter.*
- reference link / Spark Streaming APIs
- org.apache.spark.streaming.zeromq.*
- reference link / Spark Streaming APIs
- or Illinois Uniform Crime Reporting (IUCR) / Programming Spark transformations and actions
- output data streams
- output operations, DStreams
- print() / Spark Streaming operations
- saveAsTextFiles(prefix, suffix) / Spark Streaming operations
- saveAsObjectFiles(prefix, suffix) / Spark Streaming operations
- saveAsHadoopFiles(prefix, suffix) / Spark Streaming operations
- foreachRDD(func) / Spark Streaming operations
P
- packaging structure, Spark Streaming
- about / The packaging structure of Spark Streaming
- Spark Streaming APIs / Spark Streaming APIs
- Spark Streaming operations / Spark Streaming operations
- PairRDDFunctions
- Parquet
- about / Working with Parquet
- working with / Working with Parquet
- URL / Working with Parquet
- data, persisting in HDFS / Persisting Parquet data in HDFS
- partitioner
- partitioning
- about / Partitioning and schema evolution or merging , Partitioning
- reference / Partitioning
- partition keys
- about / Components of Kinesis
- partitions / Partitioning and parallelism
- performance tuning
- about / Performance tuning and best practices
- partitioning / Partitioning and parallelism
- parallelism / Partitioning and parallelism
- serialization / Serialization
- caching / Caching
- memory tuning / Memory tuning
- persistence
- handling, in Spark / Handling persistence in Spark
- phases, Spark SQL
- analysis / The Catalyst optimizer
- logical optimization / The Catalyst optimizer
- physical planning / The Catalyst optimizer
- code generation / The Catalyst optimizer
- Pig / Components of the Big Data ecosystem
- practical use cases, Spark
- batch processing / When to use Spark – practical use cases
- streaming / When to use Spark – practical use cases
- data mining / When to use Spark – practical use cases
- MLlib / When to use Spark – practical use cases
- graph computing / When to use Spark – practical use cases
- GraphX / When to use Spark – practical use cases
- interactive analysis / When to use Spark – practical use cases
- ProtocolBuffer
- reference / Schema evolution/merging
- pub-sub
- about / Getting to know more about Kafka
Q
- quasiquotes
- reference / The Catalyst optimizer
- queue
- about / Getting to know more about Kafka
R
- RabbitMQ
- RandomSentenceSpout
- about / How and when to use Storm
- RDD
- converting, to DataFrames / Converting RDDs to DataFrames
- automated process / Converting RDDs to DataFrames, Automated process
- manual process / Converting RDDs to DataFrames, The manual process
- RDD.scala
- about / RDD APIs
- RDD action operations
- about / RDD action operations
- reduce(func) / RDD action operations
- collect() / RDD action operations
- count() / RDD action operations
- countApproxDistinct(relativeSD* Double = 0.05) / RDD action operations
- countByKey() / RDD action operations
- first() / RDD action operations
- take(n) / RDD action operations
- takeSample(withReplacement, num, [seed]) / RDD action operations
- takeOrdered(Int*num) / RDD action operations
- saveAsTextFile (pathTopicn String) / RDD action operations
- saveAsSequenceFile (path* String) / RDD action operations
- saveAsObjectFile (path* String) / RDD action operations
- RDD API
- functionalities / Understanding Spark transformations and actions
- RDD APIs
- RDD transformation operations
- about / RDD transformation operations
- filter(filterFunc) / RDD transformation operations
- map(mapFunc) / RDD transformation operations
- flatMap(flatMapFunc) / RDD transformation operations
- mapPartitions(mapPartFunc, preservePartitioning) / RDD transformation operations
- distinct() / RDD transformation operations
- union(otherDataset) / RDD transformation operations
- intersection(otherDataset) / RDD transformation operations
- groupByKey([numTasks]) / RDD transformation operations
- reduceByKey(func, [numTasks]) / RDD transformation operations
- coalesce(numPartitions) / RDD transformation operations
- sortBy (f,[ascending], [numTasks]) / RDD transformation operations
- sortByKey([ascending], [numTasks]) / RDD transformation operations
- repartition(numPartitions) / RDD transformation operations
- join(otherDataset, [numTasks]) / RDD transformation operations
- real-time (RT) systems / Real-time data processing
- real-time data processing
- about / Real-time data processing
- use cases / Real-time data processing
- challenges / Real-time data processing
- real-time processing
- about / Real-time processing
- telecom or cellular arena / The telecoms or cellular arena
- transportation and logistics / Transportation and logistics
- connected vehicle / The connected vehicle
- financial sector / The financial sector
- realization, of Lambda Architecture
- about / Realization of Lambda Architecture
- high level architecture / high-level architecture
- Apache Cassandra, configuring / Configuring Apache Cassandra and Spark
- Spark, configuring / Configuring Apache Cassandra and Spark
- custom producer, coding / Coding the custom producer
- real-time layers, coding / Coding the real-time layer
- batch layers, coding / Coding the batch layer
- serving layers, coding / Coding the serving layer
- layers, executing / Executing all the layers
- Redshift
- reference / Components of Kinesis
- reduce functionality
- reference link / RDD action operations
- Relational Database Management Systems (RDBMS) / The emergence of Spark SQL
- relaxed SLAs
- about / Batch data processing
- replication / Distributed databases (NoSQL)
- resilient distributed datasets (RDD)
- about / The architecture of Spark, The Spark execution model – master-worker view, Resilient distributed datasets (RDD)
- features / RDD – by definition
- functions / Storage
- Resilient Distributed Datasets (RDD)
- Resilient Distributed Datasets (RDDs)
- reference link / Shuffling
- about / High-level architecture
- resource manager
- resource manager (RM) / The Spark execution model – master-worker view
- ResourceManager (RM) / Executing Spark Streaming applications on Yarn
- resource managers, Spark
- Apache Mesos / The Spark execution model – master-worker view
- Hadoop YARN / The Spark execution model – master-worker view
- standalone mode / The Spark execution model – master-worker view
- local mode / The Spark execution model – master-worker view
- ring buffer
- about / Ring buffer – the heart of the disruptor
- producers / Producers
- consumers / Consumers
- rule based optimizations / The Catalyst optimizer
S
- S3
- reference / Components of Kinesis
- Scala
- reference link / Spark packaging structure and core APIs
- installing / Scala
- Spark job, coding in / Coding a Spark job in Scala
- Spark Streaming job, writing in / Writing our Spark Streaming job in Scala
- Scala 2.10.5 compressed tarball
- download link / Scala
- Scala APIs, by Spark Core
- org.apache.spark / Spark packaging structure and core APIs
- org.apache.spark.SparkContext / Spark packaging structure and core APIs
- org.apache.spark.rdd.RDD.scala / Spark packaging structure and core APIs
- org.apache.spark.annotation / Spark packaging structure and core APIs
- org.apache.spark.broadcast / Spark packaging structure and core APIs
- HttpBroadcast / Spark packaging structure and core APIs
- TorrentBroadcast / Spark packaging structure and core APIs
- org.apache.spark.io / Spark packaging structure and core APIs
- org.apache.spark.scheduler / Spark packaging structure and core APIs
- org.apache.spark.storage / Spark packaging structure and core APIs
- org.apache.spark.util / Spark packaging structure and core APIs
- scalability
- reference link / Batch data processing, The need for Lambda Architecture
- schema evolution
- about / Schema evolution/merging
- schema merging
- about / Schema evolution/merging
- SequenceFileRDDFunctions
- serialization process
- shards
- about / Components of Kinesis
- for reads / Components of Kinesis
- for writes / Components of Kinesis
- single point of failure (SPOF) / The need for Lambda Architecture
- SLAs
- about / Batch data processing
- smart traversing
- software development kit (SDK) / Components of Kinesis
- Spark
- overview / An overview of Spark
- about / Apache Spark – a one-stop solution
- features / Apache Spark – a one-stop solution
- practical use cases / When to use Spark – practical use cases
- packaging structure / Spark packaging structure and core APIs
- core APIs / Spark packaging structure and core APIs
- hardware requisites / Hardware requirements
- installing / Spark
- persistence handling / Handling persistence in Spark
- storage levels / Handling persistence in Spark
- Spark-Cassandra connector
- reference link / Configuring Apache Cassandra and Spark
- Spark-Cassandra Java library
- reference link / Configuring Apache Cassandra and Spark
- Spark 1.4.0
- download link / Configuring Apache Cassandra and Spark
- Spark actions
- Spark architecture
- about / The architecture of Spark
- high-level architecture / High-level architecture
- Spark cluster
- configuring / Configuring the Spark cluster
- Spark compressed tarball
- download link / Spark
- Spark Core
- Spark core engine
- Spark driver
- Spark execution model
- Spark extensions
- Spark framework
- error / Working with Parquet
- overwrite / Working with Parquet
- append / Working with Parquet
- ignore / Working with Parquet
- Spark job
- coding, in Scala / Coding a Spark job in Scala
- coding, in Java / Coding a Spark job in Java
- Spark master
- Spark packages
- reference link / Spark extensions/libraries
- SparkR
- about / Spark extensions/libraries
- reference link / Spark extensions/libraries
- Spark SQL
- reference link / Spark extensions/libraries
- phases / The Catalyst optimizer
- SPARK SQL
- architecture / The architecture of Spark SQL
- emergence / The emergence of Spark SQL
- about / The emergence of Spark SQL
- features / The emergence of Spark SQL
- components / The components of Spark SQL
- DataFrame API / The components of Spark SQL
- catalyst optimizer / The components of Spark SQL
- Spark SQL job
- coding / Coding our first Spark SQL job
- reference / Coding our first Spark SQL job
- coding, in Scala / Coding a Spark SQL job in Scala
- coding, in Java / Coding a Spark SQL job in Java
- Spark Steaming job
- coding / Coding our first Spark Streaming job
- Spark Streaming
- reference link / When to use Spark – practical use cases, Spark extensions/libraries
- about / Spark extensions/libraries
- high level architecture / High-level architecture
- components / The components of Spark Streaming
- packaging structure / The packaging structure of Spark Streaming
- Spark Streaming APIs
- about / Spark Streaming APIs
- reference link / Spark Streaming APIs
- Spark Streaming applications
- executing, on YARN / Executing Spark Streaming applications on Yarn
- executing, on Apache Mesos / Executing Spark Streaming applications on Apache Mesos
- monitoring / Monitoring Spark Streaming applications
- reference link / Monitoring Spark Streaming applications
- Spark Streaming job
- writing, in Scala / Writing our Spark Streaming job in Scala
- writing, in Java / Writing our Spark Streaming job in Java
- executing / Executing our Spark Streaming job
- Spark streaming job
- about / The components of Spark Streaming
- data receiver / The components of Spark Streaming
- batches / The components of Spark Streaming
- DStreams / The components of Spark Streaming
- streaming contexts / The components of Spark Streaming
- Spark Streaming operations
- about / Spark Streaming operations
- Spark transformation
- Spark UI
- workers / Configuring the Spark cluster
- running applications / Configuring the Spark cluster
- completed application / Configuring the Spark cluster
- Spark worker/executors
- speed layers
- splits
- spout collector / The concept of anchoring and reliability
- SQL Streaming Crime Analyzer
- high-level architecture / The high-level architecture of our job
- crime producer, coding / Coding the crime producer
- stream consumer, coding / Coding the stream consumer and transformer
- stream transformer, coding / Coding the stream consumer and transformer
- executing / Executing the SQL Streaming Crime Analyzer
- standalone resource manager
- about / Configuring the Spark cluster
- StorageLevel class
- reference link / Persistence
- storage levels, Spark
- StorageLevel.MEMORY_ONLY / Handling persistence in Spark
- StorageLevel.MEMORY_ONLY_SER / Handling persistence in Spark
- StorageLevel.MEMORY_AND_DISK / Handling persistence in Spark
- StorageLevel.MEMORY_AND_DISK_SER / Handling persistence in Spark
- StorageLevel.DISK_ONLY / Handling persistence in Spark
- StorageLevel.MEMORY_ONLY_2, MEMORY_AND_DISK_2 / Handling persistence in Spark
- StorageLevel.OFF_HEAP / Handling persistence in Spark
- Storm
- about / Real-time processing
- overview / An overview of Storm
- journey / The journey of Storm
- performance / The journey of Storm
- scalability / The journey of Storm
- fail safe / The journey of Storm
- reliability / The journey of Storm
- easy / The journey of Storm
- open source / The journey of Storm
- abstractions / Storm abstractions
- architecture / Storm architecture and its components
- components / Storm architecture and its components
- local mode / Storm architecture and its components
- distributed mode / Storm architecture and its components
- reference / Storm architecture and its components
- using / How and when to use Storm
- input sources / Storm input sources
- performance, optimizing / Optimizing Storm performance
- reference link / Apache Spark – a one-stop solution
- Storm abstractions
- Storm acking framework
- about / The Storm acking framework
- Storm cluster
- about / A Storm cluster
- Nimbus / A Storm cluster
- Supervisors / A Storm cluster
- UI / A Storm cluster
- Storm internal message processing
- about / Storm internal message processing
- inter-worker communication / Storm internal message processing
- intra-worker communication / Storm internal message processing
- Storm internals
- about / Storm internals
- Storm parallelism / Storm parallelism
- Storm internal message processing / Storm internal message processing
- Storm internode communication
- about / Storm internode communication
- ZeroMQ / ZeroMQ
- Netty / Netty
- Storm parallelism
- about / Storm parallelism
- worker process / Storm parallelism
- executors / Storm parallelism
- tasks / Storm parallelism
- Storm persistence
- about / Storm persistence
- JDBC persistence framework / Storm's JDBC persistence framework
- Storm simple patterns
- about / Storm simple patterns
- Joins / Joins
- batching / Batching
- Storm UI
- about / Understanding the Storm UI
- landing page / Storm UI landing page
- topology home page / Topology home page
- StreamingContext
- streaming data
- querying / Querying streaming data in real time
- stream producer
- creating / Creating a stream producer
- Supervisors
- about / A Storm cluster
- workers / A Storm cluster
- executors / A Storm cluster
- tasks / A Storm cluster
T
- Tachyon
- Taychon
- TextInputFormat
- reference link / Understanding Spark transformations and actions
- Thrift
- reference / Schema evolution/merging
- transformation / Dataset processing
- transformation operations, on input streams
- reference link / Spark Streaming operations
- transformation operations, on streaming data
- windowing operations / Spark Streaming operations
- transform operations / Spark Streaming operations
- updateStateByKey Operation / Spark Streaming operations
- output operations / Spark Streaming operations
- Trident
- working with / Working with Trident
- transactions / Transactions
- topology / Trident topology
- operations / Trident operations
- Trident operations
- about / Trident operations
- merging / Merging and joining
- joining / Merging and joining
- filter / Filter, Function
- aggregation / Aggregation
- grouping / Grouping
- state maintenance / State maintenance
- Trident topology
- about / Trident topology
- Trident tuples / Trident tuples
- Trident spout / Trident spout
- troubleshooting tips
- about / Troubleshooting – tips and tricks
- port numbers, used by Spark / Port numbers used by Spark
- classpath issues / Classpath issues – class not found exception
- other common exceptions / Other common exceptions
U
- use cases, for batch data processing
- log analysis/analytics / Batch data processing
- predictive maintenance / Batch data processing
- faster claim processing / Batch data processing
- pricing analytics / Batch data processing
- use cases, real-time data processing
- Internet of Things (IoT) / Real-time data processing
- online trading systems / Real-time data processing
- online publishing / Real-time data processing
- assembly lines / Real-time data processing
- online gaming systems / Real-time data processing
W
- WordCountTopology
- about / How and when to use Storm
- Write Ahead Logs (WAL) / The technology matrix for Lambda Architecture
Y
- YARN
- URL / High-level architecture
- modes / The Spark execution model – master-worker view
- Spark Streaming applications, executing on / Executing Spark Streaming applications on Yarn
- reference link / Executing Spark Streaming applications on Yarn
- YARN client mode / The Spark execution model – master-worker view
- YARN cluster mode / The Spark execution model – master-worker view
- Yet Another Resource Negotiator (YARN) / Batch processing in distributed mode
Z
- ZeroMQ
- about / ZeroMQ
- Storm ZeroMQ configurations / Storm ZeroMQ configurations
- ZooKeeper / Optimizing Storm performance
- Zookeeper
- about / A Storm cluster
- Zookeeper cluster
- about / A Zookeeper cluster