Index
A
- abstract syntax tree (AST) / The Hive compiler
- access control entry (ACE)
- Access Control Lists (ACLs)
- about / Hadoop-0.20-security, The security pillars
- Accumulator interface
- about / The Accumulator interface
- Accumulator UDFs
- usage / The usage of Accumulator UDFs
- activate command
- about / Installation procedure
- Active Directory
- about / Group listings for an HDFS user
- administration commands
- about / YARN commands, Administration commands
- advanced aggregation support / Advanced aggregation support
- advanced FOREACH operator
- about / The advanced FOREACH operator
- FLATTEN / The FLATTEN operator
- nested FOREACH / The nested FOREACH operator
- COGROUP / The COGROUP operator
- UNION / The UNION operator
- CROSS / The CROSS operator
- advanced Pig operators
- about / The advanced Pig operators
- advanced FOREACH operator / The advanced FOREACH operator
- specialized joins / Specialized joins in Pig
- aggregate functions
- about / The aggregate functions
- Algebraic interface / The Algebraic interface
- Accumulator interface / The Accumulator interface
- algebraic function
- about / The Algebraic interface
- Algebraic interface
- about / The Algebraic interface
- Algebraic UDFs
- usage / The usage of Algebraic UDFs
- all grouping
- about / Developing with Apache Storm
- allocate method / Writing the Application Master entity
- ALTER INDEX command / Indexes on Hive tables
- Amazon
- URL, for creating account / Amazon Elastic MapReduce (EMR)
- AMRMClient class
- Apache Mahout
- about / Apache Mahout
- use cases / Apache Mahout
- used, for document analysis / Document analysis using Hadoop and Mahout
- used, for K-means clustering / K-means clustering using Apache Mahout
- URL / K-means clustering using Apache Mahout
- Apache Software Foundation (ASF)
- about / The evolution of Hadoop
- Apache Storm
- about / Apache Storm
- features / Apache Storm
- architecture / Architecture of an Apache Storm cluster
- Supervisor daemon / Architecture of an Apache Storm cluster
- high-level view / Architecture of an Apache Storm cluster
- computation / Computation and data modeling in Apache Storm
- data modeling / Computation and data modeling in Apache Storm
- abstractions / Computation and data modeling in Apache Storm
- spouts / Computation and data modeling in Apache Storm
- bolts / Computation and data modeling in Apache Storm
- topologies / Computation and data modeling in Apache Storm
- use cases / Use cases for Apache Storm
- developing / Developing with Apache Storm
- reference link, for releases / Apache Storm 0.9.1
- enhancements / Apache Storm 0.9.1
- Apache Storm 0.9.1
- about / Apache Storm 0.9.1
- Application-Client Protocol / Resource Manager (RM)
- Application-Master Protocol / Resource Manager (RM)
- application command / User commands
- Application Master (AM) / The YARN architecture
- about / Application Master (AM)
- ApplicationMaster (AM)
- about / Architecture overview
- ApplicationMaster entity
- writing / Writing the Application Master entity
- ResourceManager / Writing the Application Master entity
- NodeManager / Writing the Application Master entity
- ApplicationMasterProtocol
- ApplicationReport object / Writing YARN clients
- ApplicationsManager / Resource Manager (RM)
- architecture, Apache Storm
- about / Architecture of an Apache Storm cluster
- topologies / Architecture of an Apache Storm cluster
- Master node / Architecture of an Apache Storm cluster
- Worker node / Architecture of an Apache Storm cluster
- architecture, HDFS
- limitations / Limitations of the older HDFS architecture
- architecture, HDFS Federation
- about / Architecture of HDFS Federation
- block pools / Architecture of HDFS Federation
- Namespace Volume / Architecture of HDFS Federation
- ClusterId / Architecture of HDFS Federation
- architecture, Kerberos
- ArrayFile format / Other data structures
- ARRAYS, complex types / Data types
- auditing
- about / The security pillars
- audit logging
- about / Audit logging in Hadoop
- authentication
- about / The security pillars
- in Kerberos / Authentication in Hadoop
- via HTTP interfaces / Authentication via HTTP interfaces
- Authentication Server (AS)
- authorization
- about / The security pillars, Authorization in Hadoop
- in HDFS / Authorization in HDFS
- HDFS usage, limiting / Limiting HDFS usage
- service-level authorization / Service-level authorization in Hadoop
- automatic failover
- Avro
- about / Avro serialization
- features / Avro serialization
- and MapReduce / Avro and MapReduce
- and Pig / Avro and Pig
- and Hive / Avro and Hive
- versus Protocol Buffers / Comparison – Avro versus Protocol Buffers / Thrift
- versus Thrift / Comparison – Avro versus Protocol Buffers / Thrift
- AvroSerde module / Avro and Hive
- Avro serialization
- about / Avro serialization
B
- Backup Node
- Bag data type, Pig / Complex data types in Pig
- batch-processing systems
- disadvantages, overcoming / Batch processing versus streaming
- batch mode / Different modes of execution
- batch processing
- versus, streaming / Batch processing versus streaming
- diagrammatic representation / Batch processing versus streaming
- best practices, Pig
- about / Best practices
- explicity usages, of types / The explicit usage of types
- early and frequent projection / Early and frequent projection
- early and frequent filtering / Early and frequent filtering
- usage, of LIMIT operator / The usage of the LIMIT operator
- usage, of DISTINCT operator / The usage of the DISTINCT operator
- reduction of operations / The reduction of operations
- usage, of Algebraic UDFs / The usage of Algebraic UDFs
- usage, of Accumulator UDFs / The usage of Accumulator UDFs
- nulls, eliminating in data / Eliminating nulls in the data
- usage, of specialized joins / The usage of specialized joins
- intermediate results, compressing / Compressing intermediate results
- smaller files, combining / Combining smaller files
- bitmap
- about / Indexes on Hive tables
- Bitmap indexes / Indexes on Hive tables
- block placement, HDFS
- about / HDFS block placement
- pluggable block placement policy / Pluggable block placement policy
- block pools
- about / Architecture of HDFS Federation
- Block Storage Service, HDFS architecture
- BloomMapFile format / Other data structures
- bolt
- bucketized map-side join / Map-side joins
- bucketized sort-merge join / Map-side joins
- buckets
C
- -config option / YARN commands
- CapacityScheduler
- about / CapacityScheduler
- features / CapacityScheduler
- methods / CapacityScheduler
- CDH
- Checkpoint Node
- classes, Hadoop
- VIntWritable / Writable and WritableComparable
- VLongWritable / Writable and WritableComparable
- classification, Apache Mahout / Apache Mahout
- clauses
- using / Other advanced clauses
- client
- about / The YARN architecture
- cloud computing
- service models / Cloud computing characteristics
- cloud computing, benefits
- lower costs / Cloud computing characteristics
- elasticity / Cloud computing characteristics
- administration / Cloud computing characteristics
- cloud computing, characteristics
- on-demand self service / Cloud computing characteristics
- broad network access / Cloud computing characteristics
- resource pooling / Cloud computing characteristics
- rapid elasticity / Cloud computing characteristics
- measured service / Cloud computing characteristics
- ClusterId
- about / Architecture of HDFS Federation
- clustering
- K-means, using / Clustering using k-means
- clustering, Apache Mahout / Apache Mahout
- clusters
- about / The data model
- COGROUP operator / The COGROUP operator
- collaborative filtering, Apache Mahout / Apache Mahout
- Combiners
- Combiners, Pig / Combiners in Pig
- Command Line Interface (CLI) / The supporting components of Hive
- compact
- about / Indexes on Hive tables
- compiler, Hive / The Hive compiler
- complex data types, Pig
- Map / Complex data types in Pig
- Tuple / Complex data types in Pig
- Bag / Complex data types in Pig
- complex types
- STRUCTS / Data types
- MAPS / Data types
- ARRAYS / Data types
- UNIONS / Data types
- compressed files / Compressed files
- compression
- about / Compression
- DEFLATE compression / Compression
- and splits / Splits and compressions
- enabling, strategies / Splits and compressions
- scope / Scope for compression
- computation, Apache Storm
- constituents, RHadoop
- container
- Container Launch Context (CLC) / Node Manager (NM)
- ContainerManager
- ContainerManager protocol
- about / Application Master (AM)
- Container object / Writing the Application Master entity
- Copy phase
- core-site.xml file, properties
- hadoop.http.filter.initializers / Authentication via HTTP interfaces
- hadoop.http.authentication.type / Authentication via HTTP interfaces
- hadoop.http.authentication.token.validity / Authentication via HTTP interfaces
- hadoop.http.authentication.signature.secret.file / Authentication via HTTP interfaces
- hadoop.http.authentication.cookie.domain / Authentication via HTTP interfaces
- hadoop.http.authentication.simple.anonymous.allowed / Authentication via HTTP interfaces
- hadoop.http.authentication.kerberos.principal / Authentication via HTTP interfaces
- hadoop.http.authentication.kerberos.keytab / Authentication via HTTP interfaces
- cosine similarity distance measures / Cosine similarity distance measures
- counters
- about / MapReduce job counters
- countrycodes.txt file
- URL / Reduce-side joins
- crawling
- about / The inception of Hadoop
- CROSS operator / The CROSS operator
- cubes
- about / Advanced aggregation support
D
- daemonlog command / Administration commands
- data analytics
- about / Data analytics workflow
- workflow / Data analytics workflow
- data analytics workflow
- about / Data analytics workflow
- steps / Data analytics workflow
- database
- about / The data model
- data confidentiality
- about / Data confidentiality in Hadoop
- HTTPS / HTTPS and encrypted shuffle
- encrypted shuffle / HTTPS and encrypted shuffle
- Data Definition Language (DDL)
- about / The data model
- data mining / Machine learning
- data model
- about / The data model
- dynamic partitions / Dynamic partitions
- indexes, on Hive tables / Indexes on Hive tables
- data modeling, Apache Storm
- data protection
- about / The security pillars
- data security
- security pillars / The security pillars
- data serialization, Hadoop
- about / Data serialization in Hadoop
- Writable interface / Writable and WritableComparable
- WritableComparable interface / Writable and WritableComparable
- data types
- about / Data types
- deactivate command
- about / Installation procedure
- declareOutputFields method
- about / Developing with Apache Storm
- default / FairScheduler
- DEFERRED REBUILD directive / Indexes on Hive tables
- DESCRIBE command
- about / The DESCRIBE command
- deserialization
- about / Data serialization in Hadoop
- dev-zookeeper command
- about / Installation procedure
- development and debugging aids, Pig
- DESCRIBE / The DESCRIBE command
- EXPLAIN / The EXPLAIN command
- ILLUSTRATE / The ILLUSTRATE command
- dfs.blocksize attribute
- about / The dfs.blocksize attribute
- Directed Acyclic Graphs (DAGs)
- about / Pig versus SQL
- direct grouping
- about / Developing with Apache Storm
- DISTINCT operator
- distributive function
- about / The Algebraic interface
- DML
- about / Advanced DML
- GROUP BY operation / The GROUP BY operation
- ORDER BY clause, versus SORT BY clause / ORDER BY versus SORT BY clauses
- JOIN operator / The JOIN operator and its types
- advanced aggregation support / Advanced aggregation support
- clauses, using / Other advanced clauses
- document analysis
- Hadoop, using / Document analysis using Hadoop and Mahout
- Mahout, using / Document analysis using Hadoop and Mahout
- term frequency / Term frequency
- document frequency / Document frequency
- Tf-Idf / Tf-Idf in Pig
- cosine similarity distance measures / Cosine similarity distance measures
- clustering, with K-means / Clustering using k-means
- document frequency / Document frequency
- Driver / The supporting components of Hive
- drpc command
- about / Installation procedure
- dynamic counter
- about / MapReduce job counters
- dynamic partitions
- about / Dynamic partitions
- semantics / Semantics for dynamic partitioning
E
- embedded mode / Different modes of execution
- EMR
- comparing, with HDInsight / Hadoop on the cloud
- about / Amazon Elastic MapReduce (EMR)
- workloads, creating / Amazon Elastic MapReduce (EMR)
- workloads, executing / Amazon Elastic MapReduce (EMR)
- URL, for developer guide / Amazon Elastic MapReduce (EMR)
- Hadoop cluster, provisioning on / Provisioning a Hadoop cluster on EMR
- encrypted shuffle
- about / HTTPS and encrypted shuffle
- SSL configuration, modifying / SSL configuration changes
- keystore, configuring / Configuring the keystore and truststore
- truststore, configuring / Configuring the keystore and truststore
- enhancements, Apache Storm
- Netty-based transport / Apache Storm 0.9.1
- Windows support / Apache Storm 0.9.1
- Apache Software Foundation / Apache Storm 0.9.1
- Maven Integration / Apache Storm 0.9.1
- Euclidean distance / Cosine similarity distance measures
- evaluation criteria, Hadoop distributions
- performance / Performance
- scalability / Scalability
- reliability / Reliability
- manageability / Manageability
- evaluation functions
- about / The evaluation functions
- aggregate functions / The aggregate functions
- filter functions / The filter functions
- execute method
- about / Developing with Apache Storm
- execution engine, Hive / The Hive execution engine
- execution modes, Pig
- interactive / Different modes of execution
- batch / Different modes of execution
- embedded / Different modes of execution
- EXPLAIN command
- about / The EXPLAIN command
- EXTERNAL keyword / The data model
- external tables
- about / The data model
- Extract-Transform-Load (ETL)
- about / Pig versus SQL
F
- failover modes, Hadoop
- manual failover / High availability – edits sharing
- automatic failover / High availability – edits sharing
- FairScheduler
- about / FairScheduler
- configuring / FairScheduler
- federated NameNodes
- deploying / Deploying federated NameNodes
- field
- about / Complex data types in Pig
- fields grouping
- about / Developing with Apache Storm
- FileBasedKeyStoreFactory
- file formats
- about / File formats, File formats
- compressed files / Compressed files
- ORC files / ORC files
- Parquet files / The Parquet files
- Sequence / The Sequence file format
- MapFile / The MapFile format
- SetFile / Other data structures
- ArrayFile / Other data structures
- BloomMapFile / Other data structures
- filesystem
- implementing, in Hadoop / Implementing a filesystem in Hadoop
- filter functions / The filter functions
- filtering, MapReduce input
- about / Filtering inputs
- FilterLogicExpressionSimplifier optimization rule
- simplifications, performing / The optimization rules
- First in First Out (FIFO) / Job scheduling in YARN
- FLATTEN operator / The FLATTEN operator
- four-layer network topology
- versus three-layer network topology / Three-layer versus four-layer network topology
- Fragment-Replicate join
- about / The Replicated join
- considerations / The Replicated join
- frequent itemset mining, Apache Mahout / Apache Mahout
- fsck
- about / Useful HDFS tools
G
- getDiagnostics function / Writing YARN clients
- global grouping
- about / Developing with Apache Storm
- Global Rearrange (GR) operator / The physical plan
- Google File System (GFS)
- about / The inception of Hadoop
- Greenplum
- about / Pivotal HD
- GROUP BY operation
- Multi-Group-By Inserts / The GROUP BY operation
- Map-side aggregation for GROUP BY / The GROUP BY operation
H
- Hadoop
- inception / The inception of Hadoop
- evolution / The evolution of Hadoop
- genealogy / Hadoop's genealogy
- -0.20-append / Hadoop-0.20-append
- -0.20-security / Hadoop-0.20-security
- timeline / Hadoop's timeline
- versus Java serialization / Hadoop versus Java serialization
- filesystem, implementing / Implementing a filesystem in Hadoop
- S3 native filesystem (s3n), implementing / Implementing an S3 native filesystem in Hadoop
- used, for document analysis / Document analysis using Hadoop and Mahout
- deploying, on Microsoft Windows / Deploying Hadoop on Microsoft Windows
- building / Building Hadoop
- configuring / Configuring Hadoop
- deploying / Deploying Hadoop
- Hadoop, branches
- 0.20.1 branch / Hadoop's genealogy
- 0.20.2 branch / Hadoop's genealogy
- 0.21 branch / Hadoop's genealogy
- Hadoop-0.20-append
- about / Hadoop-0.20-append
- Hadoop-0.20-security
- about / Hadoop-0.20-security
- hadoop.security.authentication property
- Simple value / Identity of an HDFS user
- kerberos value / Identity of an HDFS user
- Hadoop 1.X
- limitations / Hadoop 2.X
- Hadoop 2.X
- YARN / Yet Another Resource Negotiator (YARN)
- storage layer enhancements / Storage layer enhancements
- other enhancements / Other enhancements
- support enhancements / Support enhancements
- Hadoop archive files (HAR)
- about / Hadoop's "small files" problem
- Hadoop as a Service (HaaS)
- about / Cloud computing characteristics
- Hadoop cluster
- provisioning, on EMR / Provisioning a Hadoop cluster on EMR
- Hadoop deployment, on Microsoft Windows
- about / Deploying Hadoop on Microsoft Windows
- prerequisites / Prerequisites
- Java JDK / Prerequisites
- Path variable, setting / Prerequisites
- JAVA_HOME environment variable, setting / Prerequisites
- Hadoop sources, downloading / Prerequisites
- Protobuf compiler / Prerequisites
- Maven Build System / Prerequisites
- Hadoop, building / Building Hadoop
- Hadoop, configuring / Configuring Hadoop
- Hadoop, deploying / Deploying Hadoop
- Hadoop distributions
- about / Hadoop distributions
- evaluation criteria / Which Hadoop distribution?
- URL / Available distributions
- CDH / Available distributions, Cloudera Distribution of Hadoop (CDH)
- HDP / Available distributions
- MapR / Cloudera Distribution of Hadoop (CDH)
- Pivotal HD / Cloudera Distribution of Hadoop (CDH), Pivotal HD
- Hadoop sources
- downloading / Prerequisites
- Hadoop Streaming
- Hadoop support, S3
- about / Hadoop support for S3
- S3 native filesystem (s3n) / Hadoop support for S3
- S3 block filesystem (s3) / Hadoop support for S3
- HADOOP_CONF_DIR environment variable / Different modes of execution
- HDFS
- advantages / HDFS – advantages and drawbacks
- drawbacks / HDFS – advantages and drawbacks
- high availability / HDFS high availability
- block placement / HDFS block placement
- name quotas / Name quotas in HDFS
- space quotas / Space quotas in HDFS
- HDFS APIs
- using / HDFS APIs and shell commands
- HDFS architecture
- Namespace component / Limitations of the older HDFS architecture
- Block Storage Service component / Limitations of the older HDFS architecture
- limitations / Limitations of the older HDFS architecture
- HDFS authorization
- about / Authorization in HDFS
- HDFS user, identitying / Identity of an HDFS user
- group listings, for HDFS user / Group listings for an HDFS user
- HDFS APIs / HDFS APIs and shell commands
- Shell commands / HDFS APIs and shell commands
- HDFS superuser, specifying / Specifying the HDFS superuser
- turning off / Turning off HDFS authorization
- HDFS Federation
- about / HDFS Federation, Architecture of HDFS Federation
- architecture / Architecture of HDFS Federation
- benefits / Benefits of HDFS Federation
- federated NameNodes, deploying / Deploying federated NameNodes
- HDInsight
- comparing, with EMR / Hadoop on the cloud
- HDP
- about / Hortonworks Data Platform (HDP)
- high availability
- about / High availability
- High Availability (HA)
- high availability, HDFS
- about / HDFS high availability
- Secondary NameNode / Secondary NameNode, Checkpoint Node, and Backup Node
- Checkpoint Node / Secondary NameNode, Checkpoint Node, and Backup Node
- Backup Node / Secondary NameNode, Checkpoint Node, and Backup Node
- edits file, sharing / High availability – edits sharing
- Hive
- and Avro / Avro and Hive
- hive.exec.max.created.files property / Semantics for dynamic partitioning
- hive.exec.max.dynamic.partitions.pernode property / Semantics for dynamic partitioning
- hive.exec.max.dynamic.partitions property / Semantics for dynamic partitioning
- Hive architecture
- about / The Hive architecture
- metastore / The Hive metastore
- compiler / The Hive compiler
- execution engine / The Hive execution engine
- supporting components / The supporting components of Hive
- Hive index
- about / Indexes on Hive tables
- Hive query optimizers
- about / Hive query optimizers
- ColumnPruner / Hive query optimizers
- GlobalLimitOptimizer / Hive query optimizers
- GroupByOptimizer / Hive query optimizers
- JoinReorder / Hive query optimizers
- PredicatePushdown / Hive query optimizers
- PredicateTransitivePropagate / Hive query optimizers
- BucketingSortingReduceSinkOptimizer / Hive query optimizers
- LimitPushdownOptimizer / Hive query optimizers
- NonBlockingOpDeDupProc / Hive query optimizers
- PartitionPruner / Hive query optimizers
- ReduceSinkDeDuplication / Hive query optimizers
- RewriteGBUsingIndex / Hive query optimizers
- StatsOptimizer / Hive query optimizers
- horizontal scaling
- about / Scalability
- HTTP interfaces
- used, for authentication / Authentication via HTTP interfaces
- HTTPS
- about / HTTPS and encrypted shuffle
I
- ILLUSTRATE command
- about / The ILLUSTRATE command
- import checkpoint
- about / Useful HDFS tools
- indexes, on Hive tables
- about / Indexes on Hive tables
- Infrastructure as a Service (IaaS)
- about / Cloud computing characteristics
- init() method / UDF, UDAF, and UDTF
- InputFormat class
- about / The InputFormat class
- functions / The InputFormat class
- InputSplit class
- attributes / The InputSplit class
- installation
- Apache Storm-on-YARN / Installing Apache Storm-on-YARN
- interactive mode / Different modes of execution
- Interface Definition Language (IDL) / Comparison – Avro versus Protocol Buffers / Thrift, Computation and data modeling in Apache Storm
- io.seqfile.compression.type property / Compressed files
- iterator() method / UDF, UDAF, and UDTF
J
- jar command / User commands
- Java Development Kit (JDK) / Prerequisites
- Java JDK / Prerequisites
- Java Runtime Environment (JRE) / Prerequisites
- Javascript Object Notation (JSON)
- about / Avro serialization
- Java serialization
- versus Hadoop / Hadoop versus Java serialization
- JAVA_HOME environment variable
- setting / Prerequisites
- job scheduling, in YARN
- about / Job scheduling in YARN
- CapacityScheduler / CapacityScheduler
- FairScheduler / FairScheduler
- JOIN operator
- about / The JOIN operator and its types
- Map-side joins / Map-side joins
- joins
- about / Handling data joins
- Reduce-side joins / Handling data joins, Reduce-side joins
- Map-side joins / Handling data joins, Map-side joins
K
- K-means
- used, for clustering / Clustering using k-means
- K-means clustering
- Apache Mahout, using / K-means clustering using Apache Mahout
- Kerberos
- architecture / The Kerberos architecture and workflow
- workflow / The Kerberos architecture and workflow
- Kerberos authentication
- about / Kerberos authentication
- mutual authentication / Kerberos authentication
- single login per session / Kerberos authentication
- protocol message encryption / Kerberos authentication
- and Hadoop / Kerberos authentication and Hadoop
- Key Distribution Center (KDC)
- about / The Kerberos architecture and workflow
- Authentication Server (AS) / The Kerberos architecture and workflow
- Ticket Granting Server (TGS) / The Kerberos architecture and workflow
- keystore
- configuring / Configuring the keystore and truststore
- about / Configuring the keystore and truststore
- keytab file
L
- label / Machine learning
- latency
- about / Performance
- Latent Dirichlet Allocation (LDA) / Apache Mahout
- lemmatization / Tf-Idf in Pig
- Lempel-Ziv-Oberhumer (LZO) / Compressed files
- Lightweight Directory Access Protocol (LDAP)
- about / Group listings for an HDFS user
- LIMIT operator
- usage / The usage of the LIMIT operator
- list command
- about / Installation procedure
- LoadFunc abstract class
- setLocation function / The load functions
- prepareToRead method / The load functions
- getInputFormat method / The load functions
- load functions
- about / The load functions
- localconfvalue command
- about / Installation procedure
- local mode / Different modes of execution
- local or shuffle grouping
- about / Developing with Apache Storm
- Local Rearrange (LR) operator / The physical plan
- logical plan, Pig scripts compilation
- about / The logical plan
- logs command / User commands
- logviewer command
- about / Installation procedure
- LZO compression format / Splits and compressions
M
- machine learning
- about / Machine learning
- process, steps / Machine learning
- machine learning, types
- supervised learning / Machine learning
- unsupervised learning / Machine learning
- semi-supervised learning / Machine learning
- manual failover
- Map-side aggregation for GROUP BY / The GROUP BY operation
- Map-side joins
- about / Map-side joins
- considerations / Map-side joins
- Map data type, Pig / Complex data types in Pig
- MapFile format / The MapFile format
- MapR
- about / MapR
- MapReduce
- and Avro / Avro and MapReduce
- about / Batch processing versus streaming
- MapReduce input
- about / MapReduce input
- InputFormat class / The InputFormat class
- InputSplit class / The InputSplit class
- RecordReader class / The RecordReader class
- Hadoop's small files, dealing / Hadoop's "small files" problem
- filtering / Filtering inputs
- mapreduce mode / Different modes of execution
- MapReduce output
- optimizing / MapReduce output
- speculative execution, of tasks / Speculative execution of tasks
- MapReduce plan, Pig scripts compilation
- about / The MapReduce plan
- MAPS, complex types / Data types
- Map task
- about / The Map task
- dfs.blocksize attribute / The dfs.blocksize attribute
- intermediate outputs, spilling / Sort and spill of intermediate outputs
- intermediate outputs, sorting / Sort and spill of intermediate outputs
- Combiners / Node-local Reducers or Combiners
- intermediate outputs, fetching / Fetching intermediate outputs – Map-side
- Master node, Apache Storm
- about / Architecture of an Apache Storm cluster
- key functions / Architecture of an Apache Storm cluster
- MasterServer
- about / Installation procedure
- Maven Build System
- about / Prerequisites
- merge() function / UDF, UDAF, and UDTF
- Merge-sparse join / The Merge join
- Merge join / The Merge join
- metastore, Hive / The Hive metastore
- Multi-Group-By Inserts / The GROUP BY operation
- multiquery mode, Pig / The multiquery mode in Pig
N
- name quotas
- about / Name quotas in HDFS
- NameServiceId
- about / Deploying federated NameNodes
- Namespace, HDFS architecture
- Namespace Volume
- about / Architecture of HDFS Federation
- National Institute of Standards and Technology (NIST)
- nested FOREACH operator / The nested FOREACH operator
- nextTuple method
- about / Developing with Apache Storm
- nimbus command
- about / Installation procedure
- node command / User commands
- NodeManager / Writing the Application Master entity
- NodeManager (NM)
- about / Architecture overview
- Node Manager (NM) / The YARN architecture
- about / Node Manager (NM)
- none grouping
- about / Developing with Apache Storm
- Nutch
- about / The evolution of Hadoop
O
- Object-relational mapping (ORM) / The Hive metastore
- open method
- about / Developing with Apache Storm
- optimization rules, Pig
- PartitionFilterOptimizer / The optimization rules
- FilterLogicExpressionSimplifier / The optimization rules
- SplitFilter / The optimization rules
- PushUpFilter / The optimization rules
- MergeFilter / The optimization rules
- PushDownForEachFlatten / The optimization rules
- LimitOptimizer / The optimization rules
- AddForEach / The optimization rules
- MergeForEach / The optimization rules
- GroupByConstParallelSetter / The optimization rules
- ORC files / ORC files
- ORDER BY clause
- versus SORT BY clause / ORDER BY versus SORT BY clauses
- outputs, Map task
- sorting / Sort and spill of intermediate outputs
- spilling / Sort and spill of intermediate outputs
- fetching / Fetching intermediate outputs – Map-side
- outputs, Reduce task
- fetching / Fetching intermediate outputs – Reduce-side
- merging / Merge and spill of intermediate outputs
- spilling / Merge and spill of intermediate outputs
- overfitting / Machine learning
P
- Package (P) operator / The physical plan
- PageRank
- about / The inception of Hadoop
- Parquet files / The Parquet files
- partitions
- about / The data model
- Path variable
- setting / Prerequisites
- performance optimizations, Pig
- optimization rules / The optimization rules
- script performance, measuring / Measurement of Pig script performance
- conditions, for invoking Combiners / Combiners in Pig
- memory, for Bag data type / Memory for the Bag data type
- number of reducers / Number of reducers in Pig
- multiquery mode / The multiquery mode in Pig
- physical plan, Pig scripts compilation
- about / The physical plan
- Pig
- versus SQL / Pig versus SQL
- primitive data types / Complex data types in Pig
- complex data types / Complex data types in Pig
- development and debugging aids / Development and debugging aids
- specialized joins / Specialized joins in Pig
- performance optimizations / Pig performance optimizations
- best practices / Best practices
- and Avro / Avro and Pig
- pig
- Tf-idf, calculating / Tf-Idf in Pig
- piggy bank
- about / User-defined functions
- Pig script performance
- measuring / Measurement of Pig script performance
- Pig scripts compilation
- about / Compiling Pig scripts
- logical plan / The logical plan
- physical plan / The physical plan
- MapReduce plan / The MapReduce plan
- Pivotal HD
- about / Pivotal HD
- Platform as a Service (PaaS)
- about / Cloud computing characteristics
- pluggable block placement policy, HDFS / Pluggable block placement policy
- plyrmr / RHadoop
- Porter Stemmer / Tf-Idf in Pig
- prepare method
- about / Developing with Apache Storm
- primitive data types, Pig / Complex data types in Pig
- Priority class / Writing the Application Master entity
- Protobuf compiler
- about / Prerequisites
- Protocol Buffers / Comparison – Avro versus Protocol Buffers / Thrift
- about / Other enhancements
- versus Avro / Comparison – Avro versus Protocol Buffers / Thrift
Q
- queuePlacementPolicy element, cluster
- rule / FairScheduler
- queues, cluster
- minResources / FairScheduler
- maxResources / FairScheduler
- maxRunningApps / FairScheduler
- weight / FairScheduler
- schedulingPolicy / FairScheduler
- aclSubmitApps / FairScheduler
- aclAdministerApps / FairScheduler
- minSharePreemptionTimeout / FairScheduler
- QuorumPeerMain service
- about / Installation procedure
R
- R
- R5
- about / The RecordReader class
- ravro / RHadoop
- rebalance command
- about / Installation procedure
- rebalancer
- about / Useful HDFS tools
- Record IO / Hadoop versus Java serialization
- RecordReader class
- about / The RecordReader class
- Reduce-side joins
- about / Reduce-side joins
- requisites / Reduce-side joins
- reference link / Reduce-side joins
- Reduce task
- about / The Reduce task
- intermediate outputs, fetching / Fetching intermediate outputs – Reduce-side
- intermediate outputs, merging / Merge and spill of intermediate outputs
- intermediate outputs, spilling / Merge and spill of intermediate outputs
- registerApplicationMaster method / Writing the Application Master entity
- Regular UDFs / UDF, UDAF, and UDTF
- remoteconfvalue command
- about / Installation procedure
- Remote Procedure Calls (RPCs)
- about / Data serialization in Hadoop
- Replicated join / The Replicated join
- replicated keyword / The Replicated join
- resource allocation
- about / Resource Manager (RM)
- ResourceManager / Writing the Application Master entity
- Resource Manager (RM) / The YARN architecture
- about / Resource Manager (RM)
- Scheduler / Resource Manager (RM)
- ApplicationsManager / Resource Manager (RM)
- ResourceManager (RM)
- about / Architecture overview
- Resource object / Writing YARN clients
- RHadoop
- rhbase / RHadoop
- rhdfs / RHadoop
- rmadmin command
- about / Administration commands
- -refreshQueues / Administration commands
- -refreshNodes / Administration commands
- -refreshUserToGroupMappings / Administration commands
- -refreshSuperUserGroupsConfiguration / Administration commands
- -refreshAdminAcls / Administration commands
- -refreshServiceAcl / Administration commands
- rmr / RHadoop
- Robot Exclusion Standard
- about / The evolution of Hadoop
- robots.txt protocol
- about / The evolution of Hadoop
- root / CapacityScheduler
S
- S3
- about / Amazon AWS S3
- Hadoop support / Hadoop support for S3
- S3 block filesystem (s3)
- about / Hadoop support for S3
- S3 native filesystem (s3n)
- about / Hadoop support for S3
- implementing, in Hadoop / Implementing an S3 native filesystem in Hadoop
- Safe Mode
- Scheduler / Resource Manager (RM)
- schemas
- about / Avro serialization
- Secondary NameNode
- security.client.datanode.protocol.acl property
- security.client.protocol.acl property
- security.datanode.protocol.acl property
- security.ha.service.protocol.acl property
- security.inter.datanode.protocol.acl property
- security.inter.tracker.protocol.acl property
- security.job.submission.protocol.acl property
- security.namenode.protocol.acl property
- security.refresh.policy.protocol.acl property
- security.task.umbilical.protocol.acl property
- security pillars, data security
- about / The security pillars
- authentication / The security pillars
- authorization / The security pillars
- auditing / The security pillars
- data protection / The security pillars
- semi-supervised learning
- about / Machine learning
- seq2sparse command / K-means clustering using Apache Mahout
- seqdumper command / K-means clustering using Apache Mahout
- Sequence files
- about / The Sequence file format
- reading / Reading and writing Sequence files
- writing / Reading and writing Sequence files
- SerDe / The supporting components of Hive, The Parquet files
- serialization
- about / Data serialization in Hadoop
- service-level authorization
- service models, cloud computing
- Infrastructure as a Service (IaaS) / Cloud computing characteristics
- Platform as a Service (PaaS) / Cloud computing characteristics
- Software as a Service (SaaS) / Cloud computing characteristics
- SetFile format / Other data structures
- setMemory method / Writing the Application Master entity
- Shell commands
- using / HDFS APIs and shell commands
- shuffle grouping
- about / Developing with Apache Storm
- Single Point of Failures (SPOF)
- about / Reliability
- Skewed joins
- about / Skewed joins
- considerations / Skewed joins
- skewed keyword / Skewed joins
- small files, Hadoop
- dealing with / Hadoop's "small files" problem
- snapshots, HDFS
- about / HDFS snapshots
- Software as a Service (SaaS)
- about / Cloud computing characteristics
- Sort Avoidance
- SORT BY clause
- versus ORDER BY clause / ORDER BY versus SORT BY clauses
- sort join / The Merge join
- space quotas
- about / Space quotas in HDFS
- Spark
- about / Developing YARN applications
- specialized joins, Pig
- Replicated join / The Replicated join
- Skewed join / Skewed joins
- Merge join / The Merge join
- usage / The usage of specialized joins
- speculative execution
- about / Speculative execution of tasks
- split-brain scenario
- splits
- and compressions / Splits and compressions
- spout
- SQL
- about / Pig versus SQL
- versus Pig / Pig versus SQL
- SSL
- about / HTTPS and encrypted shuffle
- ssl-client.xml file, properties
- ssl.client.keystore.type / Configuring the keystore and truststore
- ssl.client.keystore.location / Configuring the keystore and truststore
- ssl.client.keystore.password / Configuring the keystore and truststore
- ssl.client.truststore.type / Configuring the keystore and truststore
- ssl.client.truststore.location / Configuring the keystore and truststore
- ssl.client.truststore.password / Configuring the keystore and truststore
- ssl.client.truststore.reload.interval / Configuring the keystore and truststore
- ssl-server.xml file, properties
- ssl.server.keystore.type / Configuring the keystore and truststore
- ssl.server.keystore.location / Configuring the keystore and truststore
- ssl.server.keystore.password / Configuring the keystore and truststore
- ssl.server.truststore.type / Configuring the keystore and truststore
- ssl.server.truststore.location / Configuring the keystore and truststore
- ssl.server.truststore.password / Configuring the keystore and truststore
- ssl.server.truststore.reload.interval / Configuring the keystore and truststore
- SSL configuration
- modifying / SSL configuration changes
- stagglers
- about / Speculative execution of tasks
- stemming / Tf-Idf in Pig
- storage layer enhancements, Hadoop 2.X
- high availability / High availability
- HDFS Federation / HDFS Federation
- HDFS snapshots / HDFS snapshots
- store functions
- about / The store functions
- Storm
- about / Developing YARN applications
- Storm on YARN
- building / Storm on YARN
- installing / Installing Apache Storm-on-YARN
- prerequisites / Prerequisites
- installation procedure / Installation procedure
- Stream
- streaming computation models / Batch processing versus streaming
- stream processing
- diagrammatic representation / Batch processing versus streaming
- STRUCTS, complex types / Data types
- supervised learning
- about / Machine learning
- supervisor command
- about / Installation procedure
- supporting components, Hive / The supporting components of Hive
T
- table
- about / The data model
- term frequency / Term frequency
- terminate() method / UDF, UDAF, and UDTF
- terminatePartial() method / UDF, UDAF, and UDTF
- Tf-idf
- about / Document analysis using Hadoop and Mahout, Term frequency – inverse document frequency
- calculating, in Pig / Tf-Idf in Pig
- three-layer network topology
- versus four-layer network topology / Three-layer versus four-layer network topology
- Thrift / Comparison – Avro versus Protocol Buffers / Thrift
- versus Avro / Comparison – Avro versus Protocol Buffers / Thrift
- throughput
- about / Performance
- Ticket Granting Server (TGS)
- Ticket Granting Ticket (TGT)
- timeline, Hadoop
- about / Hadoop's timeline
- topologies
- topology
- training data / Machine learning
- truststore
- configuring / Configuring the keystore and truststore
- about / Configuring the keystore and truststore
- Tuple data type, Pig / Complex data types in Pig
U
- UDAF
- about / UDF, UDAF, and UDTF
- UDAFs
- about / UDF, UDAF, and UDTF
- UDF
- about / UDF, UDAF, and UDTF
- Regular UDFs / UDF, UDAF, and UDTF
- UDAFs / UDF, UDAF, and UDTF
- UDTF / UDF, UDAF, and UDTF
- UDTF
- about / UDF, UDAF, and UDTF
- ui command
- about / Installation procedure
- UNION operator / The UNION operator
- UNIONS, complex types / Data types
- unsupervised learning
- about / Machine learning
- use cases, Apache Mahout
- classification / Apache Mahout
- clustering / Apache Mahout
- collaborative filtering / Apache Mahout
- frequent itemset mining / Apache Mahout
- use cases, Apache Storm
- about / Use cases for Apache Storm
- algorithmic trading in stock markets / Use cases for Apache Storm
- analytics from social network feeds / Use cases for Apache Storm
- smart advertising / Use cases for Apache Storm
- location-based applications / Use cases for Apache Storm
- sensor network-based applications / Use cases for Apache Storm
- useful tools, HDFS
- about / Useful HDFS tools
- rebalancer / Useful HDFS tools
- fsck / Useful HDFS tools
- import checkpoint / Useful HDFS tools
- User-defined Aggregate Functions (UDAFs) / The supporting components of Hive
- user-defined functions (UDFs)
- about / User-defined functions
- evaluation functions / The evaluation functions
- load functions / The load functions
- store functions / The store functions
- User-defined Functions (UDFs) / The supporting components of Hive
- user commands
- about / YARN commands, User commands
- jar command / User commands
- application command / User commands
- node command / User commands
- logs command / User commands
- User Element, cluster
- maxRunningApps / FairScheduler
V
- vertical scaling
- about / Scalability
- Virtual Private Cloud (VPC)
W
- Worker node, Apache Storm
- World Wide Web (WWW)
- about / The inception of Hadoop
- WritableComparable interface
- using / Writable and WritableComparable
- Writable interface
- using / Writable and WritableComparable
Y
- YARN
- about / Yet Another Resource Negotiator (YARN)
- architecture / Architecture overview
- monitoring / Monitoring YARN
- job scheduling / Job scheduling in YARN
- yarn.scheduler.capacity.<queue-path>.acl_administer_queue property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.acl_submit_applications property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.capacity property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.maximum-applications property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.maximum-capacity property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.minimum-user- limit-percent property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.state property / CapacityScheduler
- yarn.scheduler.capacity.<queue-path>.user-limit-factor property / CapacityScheduler
- yarn.scheduler.capacity.maximum-am-resource-percent property / CapacityScheduler
- yarn.scheduler.capacity.maximum-applications property / CapacityScheduler
- yarn.scheduler.capacity.root.queues property / CapacityScheduler
- yarn.scheduler.fair.allocation.file property / FairScheduler
- yarn.scheduler.fair.allow-undeclared-pools property / FairScheduler
- yarn.scheduler.fair.locality.threshold.node property / FairScheduler
- yarn.scheduler.fair.locality.threshold.rack property / FairScheduler
- yarn.scheduler.fair.sizebasedweight property / FairScheduler
- yarn.scheduler.fair.use-as-default-queue property / FairScheduler
- YARN applications
- developing / Developing YARN applications
- YARN clients, writing / Writing YARN clients
- ApplicationMaster entity, writing / Writing the Application Master entity
- YARN architecture
- about / The YARN architecture
- Resource Manager (RM) / The YARN architecture, Resource Manager (RM)
- Node Manager (NM) / The YARN architecture, Node Manager (NM)
- Application Master (AM) / The YARN architecture, Application Master (AM)
- container / The YARN architecture
- client / The YARN architecture
- YARN clients / YARN clients
- YARN clients
- about / YARN clients
- writing / Writing YARN clients
- YARN commands
- about / YARN commands
- user commands / YARN commands, User commands
- administration commands / YARN commands
- yarn rmadmin command / CapacityScheduler