Index
A
- ACID properties
- about / ACID properties
- atomicity / ACID properties
- durability / ACID properties
- consistency / ACID properties
- action operations
- Alter table command
- Amazon Elastic MapReduce (EMR) / Hadoop distributions
- Ambari
- about / Apache Ambari
- analytic database
- about / Analytical database
- Apache Flume
- about / Apache Flume
- reliability / Reliability
- Apache Hadoop
- about / Apache Hadoop
- URL / Apache Hadoop
- Apache Hadoop, modules
- Hadoop common / Apache Hadoop
- Hadoop distributed file system (HDFS) / Apache Hadoop
- Hadoop YARN / Apache Hadoop
- Hadoop MapReduce / Apache Hadoop
- Apple Orange Mango
- about / The MapReduce example
- architecture, HBase
- about / The Architecture of HBase
- MasterServer / MasterServer
- RegionServer / RegionServer
- architecture, HDFS
- NameNode / NameNode
- DataNode / DataNode
- Checkpoint NameNode / Checkpoint NameNode or Secondary NameNode
- Secondary NameNode / Checkpoint NameNode or Secondary NameNode
- BackupNode / BackupNode
- architecture, Hive
- Metastore / Metastore
- query compiler / The Query compiler
- execution engine / The Execution engine
- architecture, MapReduce
- JobTracker / JobTracker
- TaskTracker / TaskTracker
- architecture, Pig
- about / The Pig architecture
- logical plan / The logical plan
- physical plan / The physical plan
- MapReduce plan / The MapReduce plan
- architecture, YARN
- ResourceManager / ResourceManager
- NodeManager / NodeManager
- ApplicationMaster / ApplicationMaster
- auto splitting / Auto Splitting
- auxiliary steps
- about / Auxiliary steps
- Combiner / Combiner
- Partitioner / Partitioner
B
- basic data flow, Hadoop
- about / Hadoop's basic data flow
- big data
- about / Understanding big data
- sources / Who is creating big data?
- use cases / Big data use cases
- big data, use case patterns
- about / Big data use case patterns
- storage pattern / Big data as a storage pattern
- data transformation pattern / Big data as a data transformation pattern
- data analysis pattern / Big data for a data analysis pattern
- data in real-time pattern / Big data for data in a real-time pattern
- low latency caching pattern / Big data for a low latency caching pattern
- BlockCache
- about / BlockCache
- LRUBlockCache / LRUBlockCache
- SlabCache / SlabCache
- BucketCache / BucketCache
- bolts / Bolts
- bucketing / Bucketing
C
- CAP theorem / The CAP theorem
- channels
- about / Channels
- In-Memory Queues / Channels
- Disk-based Queues / Channels
- Memory channel / Memory channel
- File channel / File Channel
- JDBC channel / JDBC Channel
- Cloudera
- about / Hadoop distributions
- column store / Types of NoSQL databases
- commands
- compaction policy
- about / Compaction, The Compaction policy
- compactions
- about / Compaction
- compaction policy / The Compaction policy
- minor compaction / Minor compaction
- major compaction / Major compaction
- complex data types
- STRUCT / Data types and schemas
- MAP / Data types and schemas
- ARRAY / Data types and schemas
- UNION / Data types and schemas
- components, Agent
- about / Components in Agent
- source / Source
- sink / Sink
- components, data model
- Tables / Logical components of a data model
- Rows / Logical components of a data model
- Column Families/Columns / Logical components of a data model
- Version/Timestamp / Logical components of a data model
- cell / Logical components of a data model
- compression types
- GZip / Compression
- LZO / Compression
- Snappy / Compression
- connectors
- about / Connectors and drivers
- counters
- Create table command
- custom SerDe class
- writing / SerDe
- Custom UDF
- performing / Custom UDF (User Defined Functions)
D
- DAG engine / Directed Acyclic Graph engine
- data access component
- Hive / Data access components
- Pig / Data access components
- Data Access components
- data analysis pattern, big data / Big data for a data analysis pattern
- data analytics
- data architecture, Storm
- Spout / Data architecture of Storm
- Bolt / Data architecture of Storm
- Topology / Data architecture of Storm
- Tuple / Data architecture of Storm
- Stream / Data architecture of Storm
- database trend
- about / Database trend
- data ingestion
- challenges / Challenges in data ingestion
- data ingestion, Hadoop
- Sqoop / Data ingestion in Hadoop, Data ingestion
- Flume / Data ingestion in Hadoop, Data ingestion
- about / Data ingestion
- Storm / Data ingestion
- data in real-time pattern, big data / Big data for data in a real-time pattern
- data processing tool
- on Hadoop / Need of a data processing tool on Hadoop
- data sources
- about / Data sources
- data sensors / Data sources
- Machine Data / Data sources
- Telco Data / Data sources
- Healthcare system data / Data sources
- Social Media / Data sources
- Geological Data / Data sources
- maps / Data sources
- aerospace / Data sources
- astronomy / Data sources
- Mobile Data / Data sources
- data storage, HDFS
- about / Data storage in HDFS
- parameters / Data storage in HDFS
- blocks / Data storage in HDFS
- replication / Data storage in HDFS
- read pipeline / Read pipeline
- write pipeline / Write pipeline
- data storage component
- HBase / Data storage component
- data transformation pattern, big data / Big data as a data transformation pattern
- data types, Pig
- primitive / Pig data types
- map / Pig data types
- tuple / Pig data types
- bag / Pig data types
- DDL operations / DDL (Data Definition Language) operations
- deployment modes, Hadoop
- standalone / Apache Hadoop
- pseudo distributed / Apache Hadoop
- distributed / Apache Hadoop
- describe table command
- Directed Acyclic Graph (DAG) pattern
- about / An introduction to Spark
- Disk-based Queues / Channels
- distributed filesystem
- about / Distributed filesystem
- HDFS / HDFS
- distributed programming
- about / Distributed programming
- DML operations / DML (Data Manipulation Language) operations
- document database / Types of NoSQL databases
- drivers
- about / Connectors and drivers
- drop table command
E
- Enterprise Data Warehouse (EDW)
- about / Big data use cases
- Enterprise Resource Planning (ERPs) / Big data for data in a real-time pattern
- execution, Pig
- modes / Pig modes
- execution engine / The Execution engine
- exports
- about / Exports
- external table
- advantages / Managing tables – external versus managed
F
- File channel
- about / File Channel
- properties / File Channel
- FileFormats
- about / FileFormats
- InputFormats / InputFormats
- RecordReader / RecordReader
- OutputFormats / OutputFormats
- RecordWriter / RecordWriter
- filters
- Flume
- about / Data ingestion in Hadoop, Data ingestion, Spark streaming
- Events / Flume nodes
- Agent / Flume nodes
- Flume architecture
- about / Flume architecture
- multitier topology / Multitier topology
- Flume configuration
- examples / Examples of configuring Flume, The Single agent example, Configuring a multiagent setup
- single agent example / The Single agent example
- multiple flow, in agent / Multiple flows in an agent
- multi-agent setup, configuring / Configuring a multiagent setup
- Flume Master / Flume master
- Flume Nodes / Flume nodes
- frameworks, distributed programming
- Hive / Distributed programming
- Pig / Distributed programming
- Spark / Distributed programming
G
- graph database / Types of NoSQL databases
- GraphX / GraphX
- groupWith
- about / Transformations
- Grunt shell
- about / Grunt shell
- input data / Input data
- data, loading / Loading data
- dump command / Dump
- store command / Store
- filter / Filter
- Group By command / Group By
- Limit command / Limit
- aggregation functions / Aggregation
- Cogroup / Cogroup
- DESCRIBE command / DESCRIBE
- EXPLAIN command / EXPLAIN
- ILLUSTRATE command / ILLUSTRATE
H
- Hadoop
- about / Hadoop
- history / Hadoop history, Description
- advantages / Advantages of Hadoop
- examples, of use cases / Uses of Hadoop
- use cases / The Hadoop use cases
- basic data flow / Hadoop's basic data flow
- Hadoop common
- about / Apache Hadoop
- Hadoop distributed file system (HDFS)
- about / Apache Hadoop
- Hadoop distributions
- about / Hadoop distributions
- Cloudera / Hadoop distributions
- Hortonworks / Hadoop distributions
- MapR / Hadoop distributions
- Amazon Elastic MapReduce (EMR) / Hadoop distributions
- Hadoop ecosystem
- about / Hadoop ecosystem, The Hadoop ecosystem
- Hadoop integration
- about / Hadoop integration
- Hadoop MapReduce
- about / Apache Hadoop
- Hadoop YARN
- about / Apache Hadoop
- HBase
- about / Data storage component, Apache HBase, An Overview of HBase
- advantages / Advantages of HBase
- HBase co-processors
- about / HBase coprocessors
- Observer / HBase coprocessors
- Endpoint / HBase coprocessors
- HBase data model
- about / The HBase data model
- logical components / Logical components of a data model
- ACID properties / ACID properties
- CAP theorem / The CAP theorem
- HBase Hive integration
- about / HBase Hive integration
- EXTERNAL / HBase Hive integration
- STORED BY / HBase Hive integration
- SERDEPROPERTIES / HBase Hive integration
- TBLPROPERTIES / HBase Hive integration
- HDFS
- about / Pillars of Hadoop, HDFS, HDFS, Spark streaming
- features / Features of HDFS
- architecture / HDFS architecture
- data storage / Data storage in HDFS
- rack awareness, configuring / Rack awareness
- Federation / HDFS federation
- ports / HDFS ports
- commands / HDFS commands
- HDFS 1.0
- limitations / Limitations of HDFS 1.0
- HDFS Federation
- benefits / The benefit of HDFS federation
- HDFS web UI ports
- URL / HDFS ports
- Hive
- about / Data access components, Distributed programming, Hive
- architecture / The Hive architecture
- data types / Data types and schemas
- schemas / Data types and schemas
- installing / Installing Hive
- Shell, starting / Starting Hive shell
- QL / HiveQL
- tables, managing / Managing tables – external versus managed
- SerDe / SerDe
- partitioning / Partitioning
- bucketing / Bucketing
- HiveQL / Distributed programming
- process flow / The Hive architecture
- about / HiveQL
- DDL operations / DDL (Data Definition Language) operations
- DML operations / DML (Data Manipulation Language) operations
- SQL operation / The SQL operation
- built-in functions / Built-in functions
- Custom UDF / Custom UDF (User Defined Functions)
- Hortonworks
- about / Hadoop distributions
I
- imports
- about / Imports
- In-Memory Queues / Channels
- International Data Corporation (IDC)
- about / Volume
J
- JDBC channel
- about / JDBC Channel
- properties / JDBC Channel
K
- Kafka
- about / Spark streaming
- key-value store / Types of NoSQL databases
- Kinesis
- about / Spark streaming
L
- low latency caching pattern, big data / Big data for a low latency caching pattern
M
- machine learning
- Mahout
- major compaction
- about / Major compaction
- hbase.hregion.majorcompaction / Major compaction
- hbase.hregion.majorcompaction.jitter / Major compaction
- Mapper
- about / The MapReduce example
- MapR
- about / Hadoop distributions
- MapReduce
- about / Pillars of Hadoop, Data access components, MapReduce
- architecture / The MapReduce architecture
- serialization data types / Serialization data types
- example / The MapReduce example
- process / The MapReduce process
- Mapper / Mapper
- shuffle and sorting / Shuffle and sorting
- Reducer / Reducer
- speculative execution / Speculative execution
- FileFormats / FileFormats
- program, writing / Writing a MapReduce program
- auxiliary steps / Auxiliary steps
- MapReduce program
- writing / Writing a MapReduce program
- Mapper code / Mapper code
- Reducer code / Reducer code
- Driver code / Driver code
- MasterServer / MasterServer
- Memory channel
- about / Memory channel
- properties / Memory channel
- Metastore / Metastore
- minor compaction
- about / Minor compaction
- hbase.store.compaction.ratio / Minor compaction
- hbase.hstore.compaction.min.size / Minor compaction
- hbase.hstore.compaction.max.size / Minor compaction
- hbase.hstore.compaction.min / Minor compaction
- MLib / MLib
- modes, Pig
- multi-agent setup
- configuring / Configuring a multiagent setup
- multiple counter / Counters
- multitier topology
- about / Multitier topology
- Flume Master / Flume master
- Flume Nodes / Flume nodes
N
- NameNode
- NoSQL database
- about / NoSQL
- NoSQL database, types
- key-value store / Types of NoSQL databases
- column store / Types of NoSQL databases
- document database / Types of NoSQL databases
- graph database / Types of NoSQL databases
- Nutch
- about / Hadoop history
O
- Observer types
- RegionObserver / HBase coprocessors
- MasterObserver / HBase coprocessors
- WALObserver / HBase coprocessors
P
- Partitioner, auxiliary steps
- custom partitioner / Custom partitioner
- partitioning
- about / Partitioning
- performance tuning
- about / Performance tuning
- compression / Compression
- filters / Filters
- counters / Counters
- co-processors / HBase coprocessors
- physical architecture / Physical architecture
- physical architecture, Storm
- Nimbus / Physical architecture of Storm
- Supervisor / Physical architecture of Storm
- Worker / Physical architecture of Storm
- Zookeeper / Physical architecture of Storm
- Pig
- about / Data access components, Distributed programming, Pig
- data types / Pig data types
- architecture / The Pig architecture
- modes / Pig modes
- Grunt shell / Grunt shell
- pipeline
- writing / The Write pipeline
- reading / The Read pipeline
- pre-splitting / Pre-Splitting
Q
- query compiler / The Query compiler
R
- rack awareness
- configuring / Rack awareness
- advantages / Advantages of rack awareness in HDFS
- RDD
- about / Resilient Distributed Dataset
- parallelized collections / Resilient Distributed Dataset
- Hadoop datasets / Resilient Distributed Dataset
- narrow dependencies / Resilient Distributed Dataset
- wide dependencies / Resilient Distributed Dataset
- features / Resilient Distributed Dataset
- real-time analysis
- about / Streaming and real-time analysis
- Reducer
- about / The MapReduce example
- RegionServer
- about / RegionServer
- WAL / WAL
- BlockCache / BlockCache
- regions / Regions
- MemStore / MemStore
- Zookeeper / Zookeeper
- reliability, Apache Flume
- end-to-end level / Reliability
- store on failure level / Reliability
- best effort level / Reliability
- Resilient Distributed Dataset (RDD)
- about / Spark architecture
S
- S3
- about / Spark streaming
- scheduling
- about / Scheduling
- schema design
- about / The Schema design
- SerDe / SerDe
- serialization data types, MapReduce
- Writable interface / The Writable interface
- WritableComparable interface / WritableComparable interface
- service programming tools
- about / Service programming
- YARN / Apache YARN
- Show tables command
- single counter / Counters
- sink types
- about / Sink
- sources types
- Spark
- about / Streaming and real-time analysis, Distributed programming, An introduction to Spark
- features / Features of Spark
- operations / Operations in Spark
- transformation operation / Transformations
- action operations / Actions
- example / Spark example
- Spark Apache docs
- URL / Transformations, Actions
- Spark architecture
- about / Spark architecture
- DAG engine / Directed Acyclic Graph engine
- RDD / Resilient Distributed Dataset
- physical architecture / Physical architecture
- Spark framework
- about / Spark framework
- Spark SQL / Spark SQL
- GraphX / GraphX
- MLib / MLib
- Spark streaming / Spark streaming
- Spark SQL / Spark SQL
- Spark streaming / Spark streaming
- speculative execution / Speculative execution
- splitting
- about / Splitting
- pre-splitting / Pre-Splitting
- auto splitting / Auto Splitting
- forced splitting / Forced Splitting
- SPOF (Single Point of Failure)
- about / HDFS federation
- spouts / Spouts
- SQL operation
- about / The SQL operation
- SELECT / The SQL operation
- joins / Joins
- aggregations / Aggregations
- Sqoop
- about / Data ingestion in Hadoop, Data ingestion, Sqoop
- Sqoop 1
- architecture / Sqoop 1 architecture
- limitations / Limitation of Sqoop 1
- Sqoop 2
- architecture / Sqoop 2 architecture
- storage pattern, big data / Big data as a storage pattern
- store command
- about / Store
- FOREACH generate / FOREACH generate
- Storm
- about / Streaming and real-time analysis, Data ingestion, An introduction to Storm
- features / Features of Storm
- physical architecture / Physical architecture of Storm
- data architecture / Data architecture of Storm
- topology / Storm topology
- integration, on YARN / Storm on YARN
- streaming
- about / Streaming and real-time analysis
- system management
- about / System management
T
- tables
- managing / Managing tables – external versus managed
- topology, Storm
- shuffle grouping / Storm topology
- fields grouping / Storm topology
- all grouping / Storm topology
- global grouping / Storm topology
- direct grouping / Storm topology
- topology configuration example
- about / Topology configuration example
- spouts / Spouts
- bolts / Bolts
- topology / Topology
- traditional systems
- about / Traditional systems
- steps / Traditional systems
- transformation operation
- about / Transformations
- map (func) / Transformations
- filter (func) / Transformations
- flatMap (func) / Transformations
- mapPartitions (func) / Transformations
- mapPartitionsWithSplit (func) / Transformations
- Sample (withReplacement,fraction, seed) / Transformations
- Union (otherDataset) / Transformations
- Distinct ([numTasks])) / Transformations
- groupByKey ([numTasks]) / Transformations
- reduceByKey (func, [numTasks]) / Transformations
- sortByKey ([ascending], [numTasks]) / Transformations
- Join (otherDataset, [numTasks]) / Transformations
- Cogroup (otherDataset, [numTasks]) / Transformations
- Cartesian (otherDataset) / Transformations
- Twitter
- about / Spark streaming
U
- use cases, Hadoop
- about / The Hadoop use cases
- User Defined Functions (UDF)
- about / Distributed programming
V
- V's, of big data
- about / V's of big data
- volume / Volume
- velocity / Velocity
- variety / Variety
W
- WORM (write once, read many)
- about / Features of HDFS
- Write Ahead Log (WAL)
- about / Reliability
Y
- YARN
- about / Pillars of Hadoop, Apache YARN, YARN
- architecture / YARN architecture
- applications / Applications powered by YARN