Index
A
- access control list (ACL) / Spark cluster managers
- accumulators / Accumulators and broadcast variables
- Active Directory (AD) / Hadoop security pillars
- active IDS / Intrusion detection system
- Airflow
- for orchestration / Airflow for orchestration
- reference / Airflow for orchestration
- about / Airflow
- components / Airflow components
- Amazon's Elastic MapReduce / Cloud distributions
- Amazon Kinesis / Logical view of Hadoop in the cloud
- Amazon Kinesis Firehouse / Logical view of Hadoop in the cloud
- anomaly-based intrusion detection system / Intrusion detection system
- Apache Ambari / Apache Ambari
- Apache Flume
- best practices / Best practices
- Apache Flume architecture / Apache Flume architecture
- Apache Kafka
- installing / Installing and running Apache Kafka
- running / Installing and running Apache Kafka
- local mode installation / Local mode installation
- distributed mode installation / Distributed mode
- best practices / Best practices
- Apache Kafka architecture / Apache Kafka architecture
- Apache Mesos / Spark cluster managers
- Apache Pig architecture / Apache Pig architecture
- Apache YARN / Spark cluster managers
- availability zones / Regions and availability zone
- Avro
- AWS
- practical example / Practical example using AWS
- AWS Data Pipeline
- about / Amazon Data Pipeline
- functionalities / Amazon Data Pipeline
- AWS Snowball / Logical view of Hadoop in the cloud
B
- balancer
- about / Data rebalancing
- properties / Data rebalancing
- best practices / Best practices for using balancer
- batch ingestion
- about / Batch ingestion
- considerations / Batch ingestion
- batch processing
- about / Batch processing
- considerations / Batch processing
- batch processing pattern
- about / Common batch processing pattern
- Slowly Changing Dimension (SCD) / Slowly changing dimension
- duplicate record / Duplicate record and small files
- small files / Duplicate record and small files
- duplicate records / Duplicate record and small files
- realtime lookup / Real-time lookup
- best practices, Flink
- parameter tool, using / Best practices
- TupleX types, avoiding / Best practices
- monitoring / Best practices
- best practices, Hadoop ecosystem
- execution engine / Best practices
- managed table, avoiding / Best practices
- file formats, selecting / Best practices
- partitioning / Best practices
- normalization / Best practices
- best practices, Impala
- file format / Best practices
- stats computation / Best practices
- partitioning / Best practices
- small files, avoiding / Best practices
- table structure / Best practices
- coordinator name / Best practices
- best practices, Spark
- large datasets, avoiding / Best practices
- GroupByKey, avoiding / Best practices
- broadcast variable, using / Best practices
- memory tuning / Best practices
- parallelism / Best practices
- RDD caching / Best practices
- best practices, Storm
- failure handling / Best practices
- checkpoint, using / Best practices
- throughput and latency, managing / Best practices
- logging / Best practices
- blocks / Defining HDFS, Blocks
- broadcast variable / Accumulators and broadcast variables
- bucketing / Partitioning and bucketing, Bucketing
- BZip2
- functions / BZip2
C
- capacity scheduler
- about / Capacity scheduler
- configuring / Configuring capacity scheduler
- CAP theorem
- about / CAP theorem
- availability / CAP theorem
- consistency / CAP theorem
- partition tolerance / CAP theorem
- channels, Apache Flume
- memory channel / Memory channel
- file channel / File channel
- Kafka channel / Kafka channel
- checkpoint
- about / Checkpoint using a secondary NameNode
- with Secondary NameNode / Checkpoint using a secondary NameNode
- checkpointing / HDFS logical architecture
- checksum / Data integrity
- checksum verification
- HDFS writes / Data integrity
- HDFS reads / Data integrity
- cloud
- logical view, of Hadoop / Logical view of Hadoop in the cloud
- cloud distributions
- about / Cloud distributions
- Amazon's Elastic MapReduce / Cloud distributions
- Microsoft Azure / Cloud distributions
- Cloudera / On-premise distribution
- Cloud Pub/Sub / Logical view of Hadoop in the cloud
- Cloud storage high availability
- about / Cloud storage high availability
- Amazon S3 outage case history / Amazon S3 outage case history
- CloudWatch
- using, in resource monitoring / Cloud-watch
- coder-decoder (codec) / Types of data compression in Hadoop
- cold data / Erasure encoding in Hadoop 3.x
- column-level filtering / Column-level filtering
- components, Airflow
- web server / Airflow components
- scheduler / Airflow components
- worker / Airflow components
- components, Apache Flume
- channel / Deep dive into source, channel, and sink
- sources / Sources
- components, Hadoop
- HDFS I/O / Introduction to benchmarking and profiling
- NameNodes / Introduction to benchmarking and profiling
- YARN scheduler / Introduction to benchmarking and profiling
- MapReduce / Introduction to benchmarking and profiling
- Hive / Introduction to benchmarking and profiling
- Pig / Introduction to benchmarking and profiling
- components, HBase
- HMaster / HBase architecture and its concept
- region servers / HBase architecture and its concept
- regions / HBase architecture and its concept
- Zookeeper / HBase architecture and its concept
- components, Heron
- Topology Master / Heron architecture
- Containers / Heron architecture
- Stream Manager / Heron architecture
- Heron Instance / Heron architecture
- Metric Manager / Heron architecture
- components, management group
- NameNodes / HDFS logical architecture
- DataNodes / HDFS logical architecture
- JournalNode / HDFS logical architecture
- Zookeeper failover controllers / HDFS logical architecture
- components, Node manager core
- resource manager component / Node manager core
- container component / Node manager core
- components, Resource Manager
- client component / Resource Manager component
- core component / Resource Manager component
- Node Manager component / Resource Manager component
- application master component / Resource Manager component
- components, Resource Manager high availability
- Resource Manager state store / Architecture of RM high availability
- Resource Manager restart and failover / Architecture of RM high availability
- failover fencing / Architecture of RM high availability
- leader elector / Architecture of RM high availability
- components, YARN
- Resource Manager / Resource Manager component
- Node manager core / Node manager core
- composite join
- about / Composite join
- used, for sorting input data / Sorting and partitioning
- used, for partitioning input data / Sorting and partitioning
- compression format
- considerations / Compression format consideration
- considerations, Resource Manager high availability
- Resource Manager state / Resource Manager high availability
- running application state / Resource Manager high availability
- automatic failover / Resource Manager high availability
- containers, Node manager core
- application master request / Node manager core
- ContainerLauncher / Node manager core
- ContainerMonitor / Node manager core
- LogHandler / Node manager core
- create, read, update, and delete (CRUD) / HBase operations and its examples
D
- data
- serializing / Serializing your data
- in transit encryption / Data in transit encryption
- at rest encryption / Data at rest encryption
- data availability / Data availability, integrity, and security
- data classification
- features / Data classification
- data compression
- about / Data compression
- benefits / Data compression
- data compression, Hadoop
- about / Types of data compression in Hadoop
- Gzip / Gzip
- BZip2 / BZip2
- Lempel-Ziv-Oberhumer (LZO) / Lempel-Ziv-Oberhumer
- Snappy / Snappy
- data formats / Data formats
- data governance
- about / Data governance
- pillars / Data governance pillars
- metadata management / Metadata management
- data life cycle management (DLM) / Data life cycle management
- data classification / Data classification
- data group
- about / HDFS logical architecture, Concepts of the data group
- blocks / Blocks
- replication / Replication
- data ingestion
- about / Data ingestion
- batch ingestion / Batch ingestion
- macro batch ingestion / Macro batch ingestion
- real-time ingestion / Real-time ingestion
- data integrity / Data integrity, Data availability, integrity, and security
- data life cycle management (DLM) / Data life cycle management
- data lookups / Data lookups
- data management
- about / Data management
- metadata management / Metadata management
- data integrity / Data integrity
- HDFS Snapshots / HDFS Snapshots
- data rebalancing / Data rebalancing
- DataNode diskbalancer tool
- capabilities / Managing disk-skewed data in Hadoop 3.x
- Data Node Protocol (DNP) / HDFS communication architecture
- DataNodes
- about / HDFS logical architecture
- internals / DataNode internals
- heartbeat / DataNode internals
- read/write / DataNode internals
- replication / DataNode internals
- block report / DataNode internals
- metrics / DataNode metrics
- data pipelines / Data pipelines
- data pipelines, tools
- AWS Data Pipeline / Amazon Data Pipeline
- Airflow / Airflow
- data processing
- about / Data processing
- batch processing / Batch processing
- micro batch processing / Micro batch processing
- real-time processing / Real-time processing
- data rebalancing / Data rebalancing
- data security / Data availability, integrity, and security
- dataset API, Flink
- about / Dataset API
- transformation / Transformation
- data sinks / Data sinks
- datasets
- streaming / What are streaming datasets?
- Data Transfer Protocol / HDFS communication architecture
- Direct Acyclic Graph (DAG) / Spark, Airflow for orchestration, Airflow
- disk skewed data
- managing, in Hadoop 3.x / Managing disk-skewed data in Hadoop 3.x
- distributed mode installation, HBase
- master node configuration / Master node configuration
- slave node configuration / Slave node configuration
- docker containers
- about / Docker containers in YARN
- configuring / Configuring Docker containers
- docker image
- running / Running the Docker image
E
- ECClient / Erasure encoding in Hadoop 3.x
- ECManager / Erasure encoding in Hadoop 3.x
- ECWorker / Erasure encoding in Hadoop 3.x
- elastic search / Hadoop logical view
- encryption
- about / Encryption
- data, in transit encryption / Data in transit encryption
- data, at rest encryption / Data at rest encryption
- enterprise-level IDS system
- anomaly-based intrusion detection system / Intrusion detection system
- signature-based intrusion detection system / Intrusion detection system
- erasure coding (EC)
- about / Overview of Hadoop 3 and its features
- advantages / Advantages of erasure coding
- disadvantages / Disadvantages of erasure coding
- erasure coding (EC), Hadoop 3.x / Erasure encoding in Hadoop 3.x
- Extract, Load, Transform (ELT)
- using, for Kafka Connect / Kafka Connect for ETL
- extracting / Kafka Connect for ETL
- transforming / Kafka Connect for ETL
- loading / Kafka Connect for ETL
- about / Data compression, Data ingestion
F
- factor considerations, for file format selection
- query performance / Query performance
- disk usage and compression / Disk usage and compression
- schema change / Schema change
- fair scheduler
- about / Fair scheduler
- queues, scheduling / Scheduling queues
- configuring / Configuring fair scheduler
- features, CAP theorem
- consistency and partition tolerance (CP) / CAP theorem
- availability and partition tolerance (AP) / CAP theorem
- availability and consistency (AC) / CAP theorem
- FIFO scheduler / FIFO scheduler
- file formats
- about / File formats, Understanding file formats
- row format / Row format and column format
- column format / Row format and column format
- schema evolution / Schema evolution
- splittable, versus non-splittable / Splittable versus non-splittable
- data compression / Compression
- text / Text
- sequence file / Sequence file
- Avro / Avro
- Optimized Row Columnar (ORC) / Optimized Row Columnar (ORC)
- Parquet / Parquet
- file formats, Hive
- splitable file format / Splitable and non-splitable file formats
- non-splitable file format / Splitable and non-splitable file formats
- filtering
- about / Filtering
- row-level filtering / Row-level filtering
- column-level filtering / Column-level filtering
- filtering patterns
- about / Filtering patterns
- top-k MapReduce, implementing / Top-k MapReduce implementation
- firewall rules
- inbound rule / Security groups/firewall rules
- outbound rule / Security groups/firewall rules
- Flink
- about / Apache Flink
- architecture / Flink architecture
- ecosystem / Apache Flink ecosystem component
- ecosystem, scenarios / Apache Flink ecosystem component
- ecosystem, components / Apache Flink ecosystem component
- dataset / Dataset and data stream API
- data stream API / Dataset and data stream API
- dataset API / Dataset API
- data streams / Data streams
- table API, exploring / Exploring the table API
- best practices / Best practices
- Flink ecosystem, components
- storage layer / Apache Flink ecosystem component
- deployment mode / Apache Flink ecosystem component
- runtime / Apache Flink ecosystem component
- DataSet and dataStream API / Apache Flink ecosystem component
- Apache Flink tools / Apache Flink ecosystem component
- Flume / Flume
- Flume event-based data ingestion / Flume event-based data ingestion
- Flume interceptor
- about / Flume interceptor
- timestamp interceptor / Timestamp interceptor
- Universally Unique Identifier (UUID) interceptor / Universally Unique Identifier (UUID) interceptor
- Regex filter interceptor / Regex filter interceptor
- custom interceptor, writing / Writing a custom interceptor
- functions, Presto
- about / Functions
- conversion functions / Conversion functions
- mathematical functions / Mathematical functions
- string functions / String functions
G
- general monitoring
- HDFS metrics / HDFS metrics
- YARN metrics / YARN metrics
- ZooKeeper metrics / ZooKeeper metrics
- Apache Ambari / Apache Ambari
- Gridmix
- used, for benchmarking mix-workloads / Gridmix
- Grunt / Introducing Pig Latin and Grunt
- Gzip
H
- Hadoop
- origins / Hadoop origins and Timelines
- timelines / Hadoop origins and Timelines
- data compression / Types of data compression in Hadoop
- logical view / Logical view of Hadoop in the cloud
- security pillars / Hadoop security pillars
- Hadoop 3
- overview / Overview of Hadoop 3 and its features
- features / Overview of Hadoop 3 and its features
- driving factors / Overview of Hadoop 3 and its features
- Hadoop 3.0
- security features / List of security features that have been worked upon in Hadoop 3.0
- Hadoop 3.x
- HDFS high availability / HDFS high availability in Hadoop 3.x
- disk skewed data, managing / Managing disk-skewed data in Hadoop 3.x
- erasure coding (EC) / Erasure encoding in Hadoop 3.x
- YARN Timeline server / YARN Timeline server in Hadoop 3.x
- opportunistic containers / Opportunistic containers in Hadoop 3.x
- Hadoop cluster
- benchmarking / Introduction to benchmarking and profiling
- profiling / Introduction to benchmarking and profiling
- Hadoop Distributed File System (HDFS)
- about / MapReduce origin, Deep dive into the HDFS architecture, HBase, Hadoop and R, HDFS
- defining / Defining HDFS
- features / Defining HDFS
- lazy persist writes / Lazy persist writes in HDFS
- interfaces / HDFS common interfaces
- Hadoop distributions
- about / Hadoop distributions
- benefits / Hadoop distributions
- on-premise distribution / On-premise distribution
- Cloudera / On-premise distribution
- Hortonworks / On-premise distribution
- MapR / On-premise distribution
- cloud distributions / Cloud distributions
- Hadoop ecosystem
- best practices / Best practices
- securing / System security
- Hadoop framework
- MapReduce workflow / MapReduce workflow in the Hadoop framework
- Hadoop logical view / Hadoop logical view
- Hadoop MapReduce framework / Deep dive into the Hadoop MapReduce framework
- Hadoop networks
- securing / Securing Hadoop networks
- Hadoop services' network perimeter
- tools, for securing / Tools for securing Hadoop services' network perimeter
- Hadoop Streaming / Hadoop and R
- HBase
- about / HBase, HBase architecture and its concept
- architecture / HBase architecture and its concept
- operations / HBase operations and its examples
- examples / HBase operations and its examples
- local mode installation / Local mode Installation
- distributed mode installation / Distributed mode installation
- best practices / Best practices
- HBase installation / Installation
- HCatalog / Introduction to HCatalog
- HDFS architecture / Deep dive into the HDFS architecture
- HDFS benchmarking
- with DFSIO / DFSIO
- HDFS command reference
- about / HDFS command reference
- file system commands / File System commands
- distributed copy / Distributed copy
- admin commands / Admin commands
- HDFS communication architecture
- about / HDFS communication architecture
- Client Protocol / HDFS communication architecture
- Data Transfer Protocol / HDFS communication architecture
- Data Node Protocol / HDFS communication architecture
- HDFS delete / HDFS delete
- hdfs diskbalancer / Overview of Hadoop 3 and its features
- HDFS diskbalancer tool / Managing disk-skewed data in Hadoop 3.x
- HDFSFileSystemWrite.java / HDFSFileSystemWrite.java
- HDFS high availability
- in Hadoop 3.x / HDFS high availability in Hadoop 3.x
- HDFS I/O / Introduction to benchmarking and profiling
- HDFS logical architecture / HDFS logical architecture
- HDFS metrics, general monitoring
- about / HDFS metrics
- NameNodes / NameNode metrics
- DataNodes / DataNode metrics
- HDFS reads
- about / Data integrity, HDFS reads and writes, HDFS read
- workflows / Read workflows
- short circuit reads / Short circuit reads
- HDFS Snapshots
- use cases / HDFS Snapshots
- HDFS split brain scenario / HDFS logical architecture
- HDFS writes
- about / Data integrity, HDFS reads and writes, HDFS write
- workflows / Write workflows
- Heron
- about / Storm/Heron
- architecture / Deep dive into the Storm/Heron architecture
- bottlenecks / Introduction to Apache Heron
- components / Heron architecture
- high availability (HA) / HDFS logical architecture, Regions and availability zone, High availability (HA)
- high availability (HA), scenarios
- server failure / Server failure
- Hive
- about / Hive, Introduction to benchmarking and profiling
- Hive client / Apache Hive architecture
- driver / Apache Hive architecture
- Metastore Server / Apache Hive architecture
- executing / Installing and running Hive
- installing / Installing and running Hive
- reference / Installing and running Hive
- queries / Hive queries
- table creation / Hive table creation
- data, loading to table / Loading data to a table
- select query / The select query
- file format, selecting / Choosing file format
- HCatalog / Introduction to HCatalog
- ACID properties / Understanding ACID in HIVE
- ACID properties, example / Example
- Pig, using / Pig with Hive
- benchmarking / Hive
- benchmarking, with TPC-DS / TPC-DS
- benchmarking, with TPC-H / TPC-H
- Hive Driver, components
- Parser / Apache Hive architecture
- Planner / Apache Hive architecture
- optimizer / Apache Hive architecture
- Executor / Apache Hive architecture
- Hive query language (HQL) / Apache Hive architecture, Hive queries
- HiveServer2 / Introduction to HiveServer2
- Hive UDF
- Hortonworks / On-premise distribution
- Hortonworks Data Flow (HDF) / On-premise distribution
- Hortonworks Data Platform (HDP) / On-premise distribution
- host intrusion detection systems (HIDS) / Intrusion detection system
- hot data / Erasure encoding in Hadoop 3.x
- Hue web interface / Understanding the Impala interface and queries
I
- Impala
- about / Impala
- architecture / Impala architecture
- interface / Understanding the Impala interface and queries
- queries / Understanding the Impala interface and queries
- implementing / Practicing Impala
- data, loading from CSV files / Practicing Impala, Loading Data from CSV files
- Impala architecture
- Impala daemon (Impalad) / Impala architecture
- statestore daemon (statestored) / Impala architecture
- catalog daemon (Catalogd) / Impala architecture
- ingestion layer
- about / Logical view of Hadoop in the cloud
- AWS Snowball / Logical view of Hadoop in the cloud
- Cloud Pub/Sub / Logical view of Hadoop in the cloud
- Amazon Kinesis Firehouse / Logical view of Hadoop in the cloud
- Amazon Kinesis / Logical view of Hadoop in the cloud
- intrusion detection system (IDS)
- about / Intrusion detection system
- active / Intrusion detection system
- passive / Intrusion detection system
- network intrusion detection system (NIDS) / Intrusion detection system
- host intrusion detection systems (HIDS) / Intrusion detection system
- intrusion prevention system (IPS) / Intrusion prevention system
J
- Java Native Interface (JNI) / Overview of Hadoop 3 and its features
- Java virtual machine (JVM) / Presto installation and basic query execution
- Java virtual machine (JVM) based components, Spark
- Driver / Spark machine learning
- Spark executor / Spark machine learning
- Cluster Manager / Spark machine learning
- JDBC/ODBC interface / Understanding the Impala interface and queries
- join pattern
- about / Join pattern
- reduce side join / Reduce side join
- map side join / Map side join (replicated join)
- JournalNode / HDFS logical architecture
K
- Kafka
- Kafka Connect
- for ETL / Kafka Connect for ETL
- Kafka connector
- about / Kafka connector
- configuration properties / Configuration properties
- Kafka consumers
- about / Internals of producer and consumer, Consumer
- topic, subscribing / Consumer
- consumer offset position / Consumer
- replay/rewind/skip messages / Consumer
- heartbeats / Consumer
- offset commits / Consumer
- deserialization / Consumer
- writing / Writing producer and consumer application
- Kafka produce
- application, writing / Writing producer and consumer application
- Kafka producer
- about / Internals of producer and consumer, Producer
- Kafka broker URLs, bootstrapping / Producer
- data serialization / Producer
- topic partition, determining / Producer
- leader of the partition, determining / Producer
- failure handling/retry ability / Producer
- batching / Producer
- application, writing / Writing producer and consumer application
- Kerberos
- advantages / Kerberos advantages
- Kerberos authentication / Kerberos authentication
- Kerberos authentication flows
- about / Kerberos authentication flows
- service authentication / Service authentication
- user authentication / User authentication
- communication, between authenticated client and authenticated Hadoop service / Communication between the authenticated client and the authenticated Hadoop service
- symmetric key-based communication, in Hadoop / Symmetric key-based communication in Hadoop
- Knox gateway
- functionalities / Tools for securing Hadoop services' network perimeter
L
- latency / Latency
- lazy persist writes, HDFS / Lazy persist writes in HDFS
- Lempel-Ziv-Oberhumer (LZO) / Lempel-Ziv-Oberhumer
- logical architecture, Hadoop
- ingestion layer / Logical view of Hadoop in the cloud
- processing / Logical view of Hadoop in the cloud
- storage and analytics / Logical view of Hadoop in the cloud
- machine learning / Logical view of Hadoop in the cloud
- lookups, batch processing
- in memory lookup / Real-time lookup
- Rest API / Real-time lookup
- Redis / Real-time lookup
- database lookup / Real-time lookup
M
- machine learning
- steps / Machine learning steps
- challenges / Common machine learning challenges
- machine learning case study
- in Spark / Machine learning case study in Spark
- macro batch ingestion / Macro batch ingestion
- Mahout / Mahout
- management group
- about / HDFS logical architecture
- components / HDFS logical architecture
- MapR / On-premise distribution
- MapReduce
- executing, over YARN / YARN and MapReduce
- use case / MapReduce use case
- about / Introduction to benchmarking and profiling
- MapReduce optimization
- about / Optimizing MapReduce
- hardware configuration / Hardware configuration
- operating system, tuning / Operating system tuning
- techniques / Optimization techniques
- run time configuration / Runtime configuration
- file system optimization / File System optimization
- MapReduce origin / MapReduce origin
- MapReduce patterns
- about / Common MapReduce patterns
- summarization patterns / Summarization patterns
- filtering patterns / Filtering patterns
- join pattern / Join pattern
- composite join / Composite join
- MapReduce workflow
- in Hadoop framework / MapReduce workflow in the Hadoop framework
- map side join / Map side join (replicated join)
- masking / Masking
- massive parallel processing (MPP) / Impala architecture
- message delivery semantics / Message delivery semantics
- metadata
- features / Metadata management
- metadata management / Metadata management, Metadata management
- metrics, DataNode
- remaining / DataNode metrics
- NumFailedVolumes / DataNode metrics
- metrics, NameNode
- CapacityRemaining / NameNode metrics
- UnderReplicatedBlocks / NameNode metrics
- MissingBlocks / NameNode metrics
- VolumeFailuresTotal / NameNode metrics
- NumDeadDataNodes / NameNode metrics
- JMX metrics / NameNode metrics
- micro-batch processing
- case study / Micro-batch processing case study
- micro batch processing
- about / Micro batch processing
- best practices / Micro batch processing
- challenges / Micro batch processing
- Microsoft Azure / Cloud distributions
- mini Reducer / Deep dive into the Hadoop MapReduce framework
- mix-workloads
- benchmarking / Mix-workloads
- benchmarking, with Rumen / Rumen
- benchmarking, with Gridmix / Gridmix
- MovieRatingDriver / MovieRatingDriver
- MovieRatingMapper / MovieRatingMapper
- MovieRatingReducer / MovieRatingReducer
- multinomial Naive Bayes / Machine learning case study in Spark
N
- Naive Bayes (NB) / Machine learning case study in Spark
- NameNode performance
- profiling / NameNode
- benchmarking, with NNBench / NNBench
- benchmarking, with NNThroughputBenchmark / NNThroughputBenchmark
- benchmarking, with synthetic load generator (SLG) / Synthetic load generator (SLG)
- NameNodes
- about / HDFS logical architecture, Introduction to benchmarking and profiling, NameNode metrics
- internals / NameNode internals
- functions / NameNode internals
- INodes / NameNode internals
- data locality / Data locality and rack awareness
- rack awareness / Data locality and rack awareness
- metrics / NameNode metrics
- network / Network
- network, concepts
- regions / Regions and availability zone
- availability zone / Regions and availability zone
- Virtual Private Cloud (VPC) / VPC and subnet
- subnets / VPC and subnet
- security groups / Security groups/firewall rules
- firewall rules / Security groups/firewall rules
- network firewalls / Network firewalls
- network intrusion detection system (NIDS) / Intrusion detection system
- network types
- segregating / Segregating different types of networks
- Nimbus / Deep dive into the Storm/Heron architecture
- NNBench
- used, for benchmarking NameNode performance / NNBench
- NNThroughputBenchmark
- used, for benchmarking NameNode performance / NNThroughputBenchmark
- node labels
- about / Node labels
- configuring / Configuring node labels
- Node manager core / Node manager core
- Nutch Distributed File System (NDFS) / Origins
O
- on-premise distribution / On-premise distribution
- operating system
- tasks / Operating system tuning
- Operating System (OS) security / System security
- operations, HBase
- put operation / Put operation
- get operation / Get operation
- delete operation / Delete operation
- batch operation / Batch operation
- opportunistic containers
- about / Opportunistic containers in Hadoop 3.x
- configuring / Configuring opportunist container
- Optimized Row Columnar (ORC) / Optimized Row Columnar (ORC)
- orchestration / Airflow for orchestration
- origins
- about / Origins
- MapReduce origin / MapReduce origin
- OS vulnerabilities
- reference / System security
- out-of-order events / Out-of-order events
P
- parallel stream processing / Parallel processing
- Parquet / Parquet
- partitioning
- about / Partitioning and bucketing, Partitioning
- prerequisite / Prerequisite
- passive IDS / Intrusion detection system
- Pig
- about / Pig, Introduction to benchmarking and profiling
- installing / Installing and running Pig
- running / Installing and running Pig
- custom UDF, using / How to use custom UDF in Pig
- using, with Hive / Pig with Hive
- best practices / Best practices
- Pig Latin
- about / Introducing Pig Latin and Grunt
- data type / Introducing Pig Latin and Grunt
- statement / Introducing Pig Latin and Grunt
- Presto
- about / Presto – introduction
- architecture / Presto architecture
- coordinator / Presto architecture
- Worker / Presto architecture
- Connectors / Presto architecture
- installation / Presto installation and basic query execution
- basic query execution / Presto installation and basic query execution
- functions / Functions
- Presto connectors
- about / Presto connectors
- Hive connector / Hive connector
- Kafka connector / Kafka connector
- MySQL connector / MySQL connector
- Redshift connector / Redshift connector
- MongoDB connector / MongoDB connector
- principles, for cluster manager selection
- high availability (HA) / Spark cluster managers
- security / Spark cluster managers
- monitoring / Spark cluster managers
- scheduling capability / Spark cluster managers
- private subnet / VPC and subnet
- processing flow, Hadoop MapReduce framework
- about / Deep dive into the Hadoop MapReduce framework
- InputFileFormat / Deep dive into the Hadoop MapReduce framework
- RecordReader / Deep dive into the Hadoop MapReduce framework
- input split / Deep dive into the Hadoop MapReduce framework
- mapper / Deep dive into the Hadoop MapReduce framework
- partitioner / Deep dive into the Hadoop MapReduce framework
- shuffling / Deep dive into the Hadoop MapReduce framework
- sorting / Deep dive into the Hadoop MapReduce framework
- reducer / Deep dive into the Hadoop MapReduce framework
- combiner / Deep dive into the Hadoop MapReduce framework
- output format / Deep dive into the Hadoop MapReduce framework
- properties, balancer
- threshold / Data rebalancing
- policy / Data rebalancing
- public subnet / VPC and subnet
Q
- Quorum Journal Manager (QJM)
- about / Quorum Journal Manager (QJM)
- operations / Quorum Journal Manager (QJM)
R
- R / Hadoop and R
- R and Hadoop integrated programming environment (RHIPE) package / Hadoop and R
- Ranger tool
- architecture / Ranger
- real-time ingestion
- about / Real-time ingestion
- features / Real-time ingestion
- real-time processing
- features / Real-time processing
- case study / Real-time processing case study
- main code / Main code
- code, executing / Executing the code
- reduce side join / Reduce side join
- Regex filter interceptor / Regex filter interceptor
- regions / Regions and availability zone
- region server, HBase
- Block Cache / HBase architecture and its concept
- MemStore / HBase architecture and its concept
- HFile / HBase architecture and its concept
- Write ahead log (WAL) / HBase architecture and its concept
- Remote Procedural Calls (RPC) / Blocks, HDFS communication architecture, Serialization, NNThroughputBenchmark
- replication / Replication
- Resilient Distributed Dataset (RDD)
- exploring / Deep dive into resilient distributed datasets
- features / RDD features
- operations / RDD operations
- transformations / RDD operations
- set operations / RDD operations
- about / Spark machine learning
- Resource Manager / Resource Manager component
- Resource Manager high availability
- about / Resource Manager high availability
- architecture / Architecture of RM high availability
- configuring / Configuring Resource Manager high availability
- resources
- managing / Managing resources
- REST APIs, YARN
- about / YARN REST APIs
- Resource Manager API / Resource Manager API
- Node Manager REST API / Node Manager REST API
- rest encryption / Data at rest encryption
- row-level filtering / Row-level filtering
- R Programming Language with Hadoop (RHadoop) / Hadoop and R
- Rumen
- used, for benchmarking mix-workloads / Rumen
S
- sample data pipeline DAG
- example / Airflow components
- Scheduler Load Simulator (SLS)
- used, for benchmarking YARN cluster / Scheduler Load Simulator (SLS)
- schedulers, YARN
- about / Introduction to YARN job scheduling
- FIFO scheduler / FIFO scheduler
- capacity scheduler / Capacity scheduler
- fair scheduler / Fair scheduler
- Secondary NameNode
- checkpoint / Checkpoint using a secondary NameNode
- sections, Hadoop logical view
- ingress/egress/processing / Hadoop logical view
- data integration components / Hadoop logical view
- data access interfaces / Hadoop logical view
- data processing engines / Hadoop logical view
- resource management frameworks / Hadoop logical view
- task and resource management / Hadoop logical view
- data input/output / Hadoop logical view
- data storage medium / Hadoop logical view
- Secure Shell (SSH) / System security
- Security Administration / Hadoop security pillars
- security information and event management (SIEM)
- working / How does SIEM work?
- collection layer / How does SIEM work?
- storage layer / How does SIEM work?
- correlation and security analytics / How does SIEM work?
- action and compliance layer / How does SIEM work?
- security monitoring
- about / Security monitoring
- intrusion detection system (IDS) / Intrusion detection system
- intrusion prevention system (IPS) / Intrusion prevention system
- security pillars, Hadoop / Hadoop security pillars
- segmentation / Segregating different types of networks
- sentiment analysis
- with Spark ML / Sentiment analysis using Spark ML
- Sentry / Sentry
- sequence file / Sequence file
- serialization
- about / Serialization
- in inter-process communication / Serialization
- in persistent storage / Serialization
- server failure
- server instance high availability / Server instance high availability
- region failure / Region and zone failure
- zone failure / Region and zone failure
- shuffling / Deep dive into the Hadoop MapReduce framework
- signature-based intrusion detection system / Intrusion detection system
- simple authentication security layer (SASL) / Data in transit encryption
- Slowly Changing Dimension (SCD)
- type 1 / Slowly changing dimensions – type 1
- type 2 / Slowly changing dimensions - type 2
- Snappy / Snappy
- social media
- use case / Use case – Twitter data
- sources, Apache Flume
- pollable source / Pollable source
- event driven source / Event-driven source
- channels / Channels
- sinks / Sinks
- Spark
- about / Spark
- internals / Apache Spark internals
- best practices / Best practices
- architecture / Spark machine learning
- machine learning case study / Machine learning case study in Spark
- SparkContext / Spark machine learning
- Spark internal components
- Spark driver / Spark driver
- Spark workers / Spark workers
- cluster manager / Cluster manager
- application job flow / Spark application job flow
- Spark job
- executing / Installing and running our first Spark job
- installing / Installing and running our first Spark job
- Spark-shell / Spark-shell
- submit command / Spark submit command
- Maven dependencies / Maven dependencies
- accumulators / Accumulators and broadcast variables
- broadcast variables / Accumulators and broadcast variables
- dataframe / Understanding dataframe and dataset
- dataset / Understanding dataframe and dataset
- dataframes, features / Dataframes
- dataset, features / Dataset
- cluster managers / Spark cluster managers
- Spark machine learning
- about / Spark machine learning
- transformer function / Transformer function
- estimator / Estimator
- pipeline / Spark ML pipeline
- sentiment analysis / Sentiment analysis using Spark ML
- spilling / MapReduce workflow in the Hadoop framework
- Storm
- about / Storm/Heron
- architecture / Deep dive into the Storm/Heron architecture
- integrations / Storm integrations
- best practices / Best practices
- Storm application, components
- spout / Concept of a Storm application
- bolt / Concept of a Storm application
- topology / Concept of a Storm application
- Storm integrations
- Kafka integration / Storm integrations
- HBase integration / Storm integrations
- HDFS integration / Storm integrations
- Storm Trident / Understanding Storm Trident
- stream data ingestion
- about / Stream data ingestion
- Flume event-based data ingestion / Flume event-based data ingestion
- with Kafka / Kafka
- stream data processing patterns
- about / Common stream data processing patterns
- unbounded data batch processing / Unbounded data batch processing
- streaming design considerations
- about / Streaming design considerations
- latency / Latency
- data availability / Data availability, integrity, and security
- data integrity / Data availability, integrity, and security
- data security / Data availability, integrity, and security
- unbounded data sources / Unbounded data sources
- data lookups / Data lookups
- data formats / Data formats
- data, serializing / Serializing your data
- parallel stream processing / Parallel processing
- out-of-order events / Out-of-order events
- message delivery semantics / Message delivery semantics
- stream processing
- source / Data streams
- event processing / Data streams
- sink / Data streams
- Structured Query Language (SQL) / Hive queries
- subnets
- private subnet / VPC and subnet
- public subnet / VPC and subnet
- summarization patterns
- about / Summarization patterns
- Word Count / Word count example
- min and max calculation / Minimum and maximum
- Supervisors / Deep dive into the Storm/Heron architecture
- synthetic load generator (SLG)
- used, for benchmarking NameNode performance / Synthetic load generator (SLG)
- system security / System security
T
- text / Text
- throughput / Latency
- timelines / Timelines
- timestamp interceptor / Timestamp interceptor
- top-k MapReduce
- implementing / Top-k MapReduce implementation
- Topology Master (TM) / Heron architecture
- TPC-DS
- used, for benchmarking Hive / TPC-DS
- TPC-H
- used, for benchmarking Hive / TPC-H
- TPC Benchmark (TPC-H) / TPC-H
- transit encryption / Data in transit encryption
- transparent data encryption (TDE) / Data at rest encryption
- types, node labels
- exclusive node label / Node labels
- non-exclusive node lables / Node labels
U
- unbounded data batch processing / Unbounded data batch processing
- unbounded data sources / Unbounded data sources
- Universally Unique Identifier (UUID) interceptor / Universally Unique Identifier (UUID) interceptor
- use case, MapReduce
- about / MapReduce use case
- MovieRatingMapper / MovieRatingMapper
- MovieRatingReducer / MovieRatingReducer
- MovieRatingDriver / MovieRatingDriver
- use cases, HDFS Snapshots
- backup / HDFS Snapshots
- protection / HDFS Snapshots
- application testing / HDFS Snapshots
- Distributed Copy (distcp) / HDFS Snapshots
- Legal and Auditing / HDFS Snapshots
- user authorization
- about / User authorization
- Ranger tool / Ranger
- Sentry tool / Sentry
- user command, YARN command reference
- application commands / Application commands
- logs command / Logs command
- user defined functions (UDF)
- about / Hive UDF
- writing, in Pig / Writing UDF in Pig
- user defined functions (UDF), Pig
- eval function / Eval function
- filter function / Filter function
V
- virtual local area networks (VLANs) / Segregating different types of networks
- Virtual Private Cloud (VPC) / VPC and subnet
W
- warm data / Erasure encoding in Hadoop 3.x
- Word Count, summarization patterns
- about / Word count example
- mapper / Mapper
- reducer / Reducer
- combiner / Combiner
- write-forward logging (WAL) / Kafka
- write once read many principle / Defining HDFS
Y
- YARN
- Docker containers / Docker containers in YARN
- about / Deep dive into the Hadoop MapReduce framework
- MapReduce, executing / YARN and MapReduce
- YARN cluster
- benchmarking / YARN
- benchmarking, with Scheduler Load Simulator (SLS) / Scheduler Load Simulator (SLS)
- YARN command reference
- about / YARN command reference
- user command / User command
- administration commands / Administration commands
- YARN container
- running, as docker container / Running the container
- YARN job scheduling / Introduction to YARN job scheduling
- YARN metrics
- unhealthyNodes / YARN metrics
- lostNode / YARN metrics
- allocatedMB/totalMB / YARN metrics
- containersFailed / YARN metrics
- YARN scheduler / Introduction to benchmarking and profiling
- YARN Timeline server
- about / YARN Timeline server in Hadoop 3.x
- configuring / Configuring YARN Timeline server
- Yet Another Resource Negotiator (YARN)
- about / Architecture
- architecture / Architecture
Z
- ZKFailoverController (ZKFC) / HDFS logical architecture
- ZooKeeper / Apache Kafka architecture
- Zookeeper failover controllers / HDFS logical architecture
- ZooKeeper metrics
- zk_num_alive_connections / ZooKeeper metrics
- zk_followers / ZooKeeper metrics
- zk_avg_latency / ZooKeeper metrics
- Zookeeper Quorum / HDFS logical architecture