Book Image

Hadoop Essentials

By : Shiva Achari
Book Image

Hadoop Essentials

By: Shiva Achari

Overview of this book

This book jumps into the world of Hadoop and its tools, to help you learn how to use them effectively to optimize and improve the way you handle Big Data. Starting with the fundamentals Hadoop YARN, MapReduce, HDFS, and other vital elements in the Hadoop ecosystem, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also explore a number of the leading data processing tools including Hive and Pig, and learn how to use Sqoop and Flume, two of the most powerful technologies used for data ingestion. With further guidance on data streaming and real-time analytics with Storm and Spark, Hadoop Essentials is a reliable and relevant resource for anyone who understands the difficulties - and opportunities - presented by Big Data today. With this guide, you'll develop your confidence with Hadoop, and be able to use the knowledge and skills you learn to successfully harness its unparalleled capabilities.
Table of Contents (15 chapters)
Hadoop Essentials
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
3
Pillars of Hadoop – HDFS, MapReduce, and YARN
Index

Index

A

  • ACID properties
    • about / ACID properties
    • atomicity / ACID properties
    • durability / ACID properties
    • consistency / ACID properties
  • action operations
    • about / Actions
    • Reduce (func) / Actions
    • Collect () / Actions
    • Count () / Actions
    • First () / Actions
    • Take (n) / Actions
    • takeSample (withReplacement,num, seed) / Actions
    • saveAsTextFile (path) / Actions
    • saveAsSequenceFile (path) / Actions
    • countByKey () / Actions
    • Foreach (func) / Actions
  • Alter table command
    • about / DDL (Data Definition Language) operations
  • Amazon Elastic MapReduce (EMR) / Hadoop distributions
  • Ambari
    • about / Apache Ambari
  • analytic database
    • about / Analytical database
  • Apache Flume
    • about / Apache Flume
    • reliability / Reliability
  • Apache Hadoop
    • about / Apache Hadoop
    • URL / Apache Hadoop
  • Apache Hadoop, modules
    • Hadoop common / Apache Hadoop
    • Hadoop distributed file system (HDFS) / Apache Hadoop
    • Hadoop YARN / Apache Hadoop
    • Hadoop MapReduce / Apache Hadoop
  • Apple Orange Mango
    • about / The MapReduce example
  • architecture, HBase
    • about / The Architecture of HBase
    • MasterServer / MasterServer
    • RegionServer / RegionServer
  • architecture, HDFS
    • NameNode / NameNode
    • DataNode / DataNode
    • Checkpoint NameNode / Checkpoint NameNode or Secondary NameNode
    • Secondary NameNode / Checkpoint NameNode or Secondary NameNode
    • BackupNode / BackupNode
  • architecture, Hive
    • Metastore / Metastore
    • query compiler / The Query compiler
    • execution engine / The Execution engine
  • architecture, MapReduce
    • JobTracker / JobTracker
    • TaskTracker / TaskTracker
  • architecture, Pig
    • about / The Pig architecture
    • logical plan / The logical plan
    • physical plan / The physical plan
    • MapReduce plan / The MapReduce plan
  • architecture, YARN
    • ResourceManager / ResourceManager
    • NodeManager / NodeManager
    • ApplicationMaster / ApplicationMaster
  • auto splitting / Auto Splitting
  • auxiliary steps
    • about / Auxiliary steps
    • Combiner / Combiner
    • Partitioner / Partitioner

B

  • basic data flow, Hadoop
    • about / Hadoop's basic data flow
  • big data
    • about / Understanding big data
    • sources / Who is creating big data?
    • use cases / Big data use cases
  • big data, use case patterns
    • about / Big data use case patterns
    • storage pattern / Big data as a storage pattern
    • data transformation pattern / Big data as a data transformation pattern
    • data analysis pattern / Big data for a data analysis pattern
    • data in real-time pattern / Big data for data in a real-time pattern
    • low latency caching pattern / Big data for a low latency caching pattern
  • BlockCache
    • about / BlockCache
    • LRUBlockCache / LRUBlockCache
    • SlabCache / SlabCache
    • BucketCache / BucketCache
  • bolts / Bolts
  • bucketing / Bucketing

C

  • CAP theorem / The CAP theorem
  • channels
    • about / Channels
    • In-Memory Queues / Channels
    • Disk-based Queues / Channels
    • Memory channel / Memory channel
    • File channel / File Channel
    • JDBC channel / JDBC Channel
  • Cloudera
    • about / Hadoop distributions
  • column store / Types of NoSQL databases
  • commands
    • about / Commands
    • help / Commands
    • create / Create
    • list / List
    • put / Put
    • scan / Scan
    • get / Get
    • disable / Disable
    • drop / Drop
  • compaction policy
    • about / Compaction, The Compaction policy
  • compactions
    • about / Compaction
    • compaction policy / The Compaction policy
    • minor compaction / Minor compaction
    • major compaction / Major compaction
  • complex data types
    • STRUCT / Data types and schemas
    • MAP / Data types and schemas
    • ARRAY / Data types and schemas
    • UNION / Data types and schemas
  • components, Agent
    • about / Components in Agent
    • source / Source
    • sink / Sink
  • components, data model
    • Tables / Logical components of a data model
    • Rows / Logical components of a data model
    • Column Families/Columns / Logical components of a data model
    • Version/Timestamp / Logical components of a data model
    • cell / Logical components of a data model
  • compression types
    • GZip / Compression
    • LZO / Compression
    • Snappy / Compression
  • connectors
    • about / Connectors and drivers
  • counters
    • about / Counters
    • single counter / Counters
    • multiple counter / Counters
  • Create table command
    • about / DDL (Data Definition Language) operations
  • custom SerDe class
    • writing / SerDe
  • Custom UDF
    • performing / Custom UDF (User Defined Functions)

D

  • DAG engine / Directed Acyclic Graph engine
  • data access component
    • Hive / Data access components
    • Pig / Data access components
  • Data Access components
    • about / Need of a data processing tool on Hadoop
    • Pig / Need of a data processing tool on Hadoop
    • Hive / Need of a data processing tool on Hadoop
  • data analysis pattern, big data / Big data for a data analysis pattern
  • data analytics
    • about / Data analytics and machine learning
  • data architecture, Storm
    • Spout / Data architecture of Storm
    • Bolt / Data architecture of Storm
    • Topology / Data architecture of Storm
    • Tuple / Data architecture of Storm
    • Stream / Data architecture of Storm
  • database trend
    • about / Database trend
  • data ingestion
    • challenges / Challenges in data ingestion
  • data ingestion, Hadoop
    • Sqoop / Data ingestion in Hadoop, Data ingestion
    • Flume / Data ingestion in Hadoop, Data ingestion
    • about / Data ingestion
    • Storm / Data ingestion
  • data in real-time pattern, big data / Big data for data in a real-time pattern
  • data processing tool
    • on Hadoop / Need of a data processing tool on Hadoop
  • data sources
    • about / Data sources
    • data sensors / Data sources
    • Machine Data / Data sources
    • Telco Data / Data sources
    • Healthcare system data / Data sources
    • Social Media / Data sources
    • Geological Data / Data sources
    • maps / Data sources
    • aerospace / Data sources
    • astronomy / Data sources
    • Mobile Data / Data sources
  • data storage, HDFS
    • about / Data storage in HDFS
    • parameters / Data storage in HDFS
    • blocks / Data storage in HDFS
    • replication / Data storage in HDFS
    • read pipeline / Read pipeline
    • write pipeline / Write pipeline
  • data storage component
    • HBase / Data storage component
  • data transformation pattern, big data / Big data as a data transformation pattern
  • data types, Pig
    • primitive / Pig data types
    • map / Pig data types
    • tuple / Pig data types
    • bag / Pig data types
  • DDL operations / DDL (Data Definition Language) operations
  • deployment modes, Hadoop
    • standalone / Apache Hadoop
    • pseudo distributed / Apache Hadoop
    • distributed / Apache Hadoop
  • describe table command
    • about / DDL (Data Definition Language) operations
  • Directed Acyclic Graph (DAG) pattern
    • about / An introduction to Spark
  • Disk-based Queues / Channels
  • distributed filesystem
    • about / Distributed filesystem
    • HDFS / HDFS
  • distributed programming
    • about / Distributed programming
  • DML operations / DML (Data Manipulation Language) operations
  • document database / Types of NoSQL databases
  • drivers
    • about / Connectors and drivers
  • drop table command
    • about / DDL (Data Definition Language) operations

E

  • Enterprise Data Warehouse (EDW)
    • about / Big data use cases
  • Enterprise Resource Planning (ERPs) / Big data for data in a real-time pattern
  • execution, Pig
    • modes / Pig modes
  • execution engine / The Execution engine
  • exports
    • about / Exports
  • external table
    • advantages / Managing tables – external versus managed

F

  • File channel
    • about / File Channel
    • properties / File Channel
  • FileFormats
    • about / FileFormats
    • InputFormats / InputFormats
    • RecordReader / RecordReader
    • OutputFormats / OutputFormats
    • RecordWriter / RecordWriter
  • filters
    • about / Filters
    • Column Value / Filters
    • SingleColumnValueFilter / Filters
    • ColumnRangeFilter / Filters
    • KeyValue / Filters
    • FamilyFilter / Filters
    • QualifierFilter / Filters
    • RowKey / Filters
    • RowFilter / Filters
    • Multiple Filters / Filters
  • Flume
    • about / Data ingestion in Hadoop, Data ingestion, Spark streaming
    • Events / Flume nodes
    • Agent / Flume nodes
  • Flume architecture
    • about / Flume architecture
    • multitier topology / Multitier topology
  • Flume configuration
    • examples / Examples of configuring Flume, The Single agent example, Configuring a multiagent setup
    • single agent example / The Single agent example
    • multiple flow, in agent / Multiple flows in an agent
    • multi-agent setup, configuring / Configuring a multiagent setup
  • Flume Master / Flume master
  • Flume Nodes / Flume nodes
  • frameworks, distributed programming
    • Hive / Distributed programming
    • Pig / Distributed programming
    • Spark / Distributed programming

G

  • graph database / Types of NoSQL databases
  • GraphX / GraphX
  • groupWith
    • about / Transformations
  • Grunt shell
    • about / Grunt shell
    • input data / Input data
    • data, loading / Loading data
    • dump command / Dump
    • store command / Store
    • filter / Filter
    • Group By command / Group By
    • Limit command / Limit
    • aggregation functions / Aggregation
    • Cogroup / Cogroup
    • DESCRIBE command / DESCRIBE
    • EXPLAIN command / EXPLAIN
    • ILLUSTRATE command / ILLUSTRATE

H

  • Hadoop
    • about / Hadoop
    • history / Hadoop history, Description
    • advantages / Advantages of Hadoop
    • examples, of use cases / Uses of Hadoop
    • use cases / The Hadoop use cases
    • basic data flow / Hadoop's basic data flow
  • Hadoop common
    • about / Apache Hadoop
  • Hadoop distributed file system (HDFS)
    • about / Apache Hadoop
  • Hadoop distributions
    • about / Hadoop distributions
    • Cloudera / Hadoop distributions
    • Hortonworks / Hadoop distributions
    • MapR / Hadoop distributions
    • Amazon Elastic MapReduce (EMR) / Hadoop distributions
  • Hadoop ecosystem
    • about / Hadoop ecosystem, The Hadoop ecosystem
  • Hadoop integration
    • about / Hadoop integration
  • Hadoop MapReduce
    • about / Apache Hadoop
  • Hadoop YARN
    • about / Apache Hadoop
  • HBase
    • about / Data storage component, Apache HBase, An Overview of HBase
    • advantages / Advantages of HBase
  • HBase co-processors
    • about / HBase coprocessors
    • Observer / HBase coprocessors
    • Endpoint / HBase coprocessors
  • HBase data model
    • about / The HBase data model
    • logical components / Logical components of a data model
    • ACID properties / ACID properties
    • CAP theorem / The CAP theorem
  • HBase Hive integration
    • about / HBase Hive integration
    • EXTERNAL / HBase Hive integration
    • STORED BY / HBase Hive integration
    • SERDEPROPERTIES / HBase Hive integration
    • TBLPROPERTIES / HBase Hive integration
  • HDFS
    • about / Pillars of Hadoop, HDFS, HDFS, Spark streaming
    • features / Features of HDFS
    • architecture / HDFS architecture
    • data storage / Data storage in HDFS
    • rack awareness, configuring / Rack awareness
    • Federation / HDFS federation
    • ports / HDFS ports
    • commands / HDFS commands
  • HDFS 1.0
    • limitations / Limitations of HDFS 1.0
  • HDFS Federation
    • benefits / The benefit of HDFS federation
  • HDFS web UI ports
    • URL / HDFS ports
  • Hive
    • about / Data access components, Distributed programming, Hive
    • architecture / The Hive architecture
    • data types / Data types and schemas
    • schemas / Data types and schemas
    • installing / Installing Hive
    • Shell, starting / Starting Hive shell
    • QL / HiveQL
    • tables, managing / Managing tables – external versus managed
    • SerDe / SerDe
    • partitioning / Partitioning
    • bucketing / Bucketing
  • HiveQL / Distributed programming
    • process flow / The Hive architecture
    • about / HiveQL
    • DDL operations / DDL (Data Definition Language) operations
    • DML operations / DML (Data Manipulation Language) operations
    • SQL operation / The SQL operation
    • built-in functions / Built-in functions
    • Custom UDF / Custom UDF (User Defined Functions)
  • Hortonworks
    • about / Hadoop distributions

I

  • imports
    • about / Imports
  • In-Memory Queues / Channels
  • International Data Corporation (IDC)
    • about / Volume

J

  • JDBC channel
    • about / JDBC Channel
    • properties / JDBC Channel

K

  • Kafka
    • about / Spark streaming
  • key-value store / Types of NoSQL databases
  • Kinesis
    • about / Spark streaming

L

  • low latency caching pattern, big data / Big data for a low latency caching pattern

M

  • machine learning
    • about / Data analytics and machine learning
  • Mahout
    • about / Data analytics and machine learning
  • major compaction
    • about / Major compaction
    • hbase.hregion.majorcompaction / Major compaction
    • hbase.hregion.majorcompaction.jitter / Major compaction
  • Mapper
    • about / The MapReduce example
  • MapR
    • about / Hadoop distributions
  • MapReduce
    • about / Pillars of Hadoop, Data access components, MapReduce
    • architecture / The MapReduce architecture
    • serialization data types / Serialization data types
    • example / The MapReduce example
    • process / The MapReduce process
    • Mapper / Mapper
    • shuffle and sorting / Shuffle and sorting
    • Reducer / Reducer
    • speculative execution / Speculative execution
    • FileFormats / FileFormats
    • program, writing / Writing a MapReduce program
    • auxiliary steps / Auxiliary steps
  • MapReduce program
    • writing / Writing a MapReduce program
    • Mapper code / Mapper code
    • Reducer code / Reducer code
    • Driver code / Driver code
  • MasterServer / MasterServer
  • Memory channel
    • about / Memory channel
    • properties / Memory channel
  • Metastore / Metastore
  • minor compaction
    • about / Minor compaction
    • hbase.store.compaction.ratio / Minor compaction
    • hbase.hstore.compaction.min.size / Minor compaction
    • hbase.hstore.compaction.max.size / Minor compaction
    • hbase.hstore.compaction.min / Minor compaction
  • MLib / MLib
  • modes, Pig
    • Local Mode / Pig modes
    • MapReduce Mode / Pig modes
  • multi-agent setup
    • configuring / Configuring a multiagent setup
  • multiple counter / Counters
  • multitier topology
    • about / Multitier topology
    • Flume Master / Flume master
    • Flume Nodes / Flume nodes

N

  • NameNode
    • Fsimage file / NameNode
    • Editlog file / NameNode
  • NoSQL database
    • about / NoSQL
  • NoSQL database, types
    • key-value store / Types of NoSQL databases
    • column store / Types of NoSQL databases
    • document database / Types of NoSQL databases
    • graph database / Types of NoSQL databases
  • Nutch
    • about / Hadoop history

O

  • Observer types
    • RegionObserver / HBase coprocessors
    • MasterObserver / HBase coprocessors
    • WALObserver / HBase coprocessors

P

  • Partitioner, auxiliary steps
    • custom partitioner / Custom partitioner
  • partitioning
    • about / Partitioning
  • performance tuning
    • about / Performance tuning
    • compression / Compression
    • filters / Filters
    • counters / Counters
    • co-processors / HBase coprocessors
  • physical architecture / Physical architecture
  • physical architecture, Storm
    • Nimbus / Physical architecture of Storm
    • Supervisor / Physical architecture of Storm
    • Worker / Physical architecture of Storm
    • Zookeeper / Physical architecture of Storm
  • Pig
    • about / Data access components, Distributed programming, Pig
    • data types / Pig data types
    • architecture / The Pig architecture
    • modes / Pig modes
    • Grunt shell / Grunt shell
  • pipeline
    • writing / The Write pipeline
    • reading / The Read pipeline
  • pre-splitting / Pre-Splitting

Q

  • query compiler / The Query compiler

R

  • rack awareness
    • configuring / Rack awareness
    • advantages / Advantages of rack awareness in HDFS
  • RDD
    • about / Resilient Distributed Dataset
    • parallelized collections / Resilient Distributed Dataset
    • Hadoop datasets / Resilient Distributed Dataset
    • narrow dependencies / Resilient Distributed Dataset
    • wide dependencies / Resilient Distributed Dataset
    • features / Resilient Distributed Dataset
  • real-time analysis
    • about / Streaming and real-time analysis
  • Reducer
    • about / The MapReduce example
  • RegionServer
    • about / RegionServer
    • WAL / WAL
    • BlockCache / BlockCache
    • regions / Regions
    • MemStore / MemStore
    • Zookeeper / Zookeeper
  • reliability, Apache Flume
    • end-to-end level / Reliability
    • store on failure level / Reliability
    • best effort level / Reliability
  • Resilient Distributed Dataset (RDD)
    • about / Spark architecture

S

  • S3
    • about / Spark streaming
  • scheduling
    • about / Scheduling
  • schema design
    • about / The Schema design
  • SerDe / SerDe
  • serialization data types, MapReduce
    • Writable interface / The Writable interface
    • WritableComparable interface / WritableComparable interface
  • service programming tools
    • about / Service programming
    • YARN / Apache YARN
  • Show tables command
    • about / DDL (Data Definition Language) operations
  • single counter / Counters
  • sink types
    • about / Sink
  • sources types
    • about / Source
    • URL / Source
  • Spark
    • about / Streaming and real-time analysis, Distributed programming, An introduction to Spark
    • features / Features of Spark
    • operations / Operations in Spark
    • transformation operation / Transformations
    • action operations / Actions
    • example / Spark example
  • Spark Apache docs
    • URL / Transformations, Actions
  • Spark architecture
    • about / Spark architecture
    • DAG engine / Directed Acyclic Graph engine
    • RDD / Resilient Distributed Dataset
    • physical architecture / Physical architecture
  • Spark framework
    • about / Spark framework
    • Spark SQL / Spark SQL
    • GraphX / GraphX
    • MLib / MLib
    • Spark streaming / Spark streaming
  • Spark SQL / Spark SQL
  • Spark streaming / Spark streaming
  • speculative execution / Speculative execution
  • splitting
    • about / Splitting
    • pre-splitting / Pre-Splitting
    • auto splitting / Auto Splitting
    • forced splitting / Forced Splitting
  • SPOF (Single Point of Failure)
    • about / HDFS federation
  • spouts / Spouts
  • SQL operation
    • about / The SQL operation
    • SELECT / The SQL operation
    • joins / Joins
    • aggregations / Aggregations
  • Sqoop
    • about / Data ingestion in Hadoop, Data ingestion, Sqoop
  • Sqoop 1
    • architecture / Sqoop 1 architecture
    • limitations / Limitation of Sqoop 1
  • Sqoop 2
    • architecture / Sqoop 2 architecture
  • storage pattern, big data / Big data as a storage pattern
  • store command
    • about / Store
    • FOREACH generate / FOREACH generate
  • Storm
    • about / Streaming and real-time analysis, Data ingestion, An introduction to Storm
    • features / Features of Storm
    • physical architecture / Physical architecture of Storm
    • data architecture / Data architecture of Storm
    • topology / Storm topology
    • integration, on YARN / Storm on YARN
  • streaming
    • about / Streaming and real-time analysis
  • system management
    • about / System management

T

  • tables
    • managing / Managing tables – external versus managed
  • topology, Storm
    • shuffle grouping / Storm topology
    • fields grouping / Storm topology
    • all grouping / Storm topology
    • global grouping / Storm topology
    • direct grouping / Storm topology
  • topology configuration example
    • about / Topology configuration example
    • spouts / Spouts
    • bolts / Bolts
    • topology / Topology
  • traditional systems
    • about / Traditional systems
    • steps / Traditional systems
  • transformation operation
    • about / Transformations
    • map (func) / Transformations
    • filter (func) / Transformations
    • flatMap (func) / Transformations
    • mapPartitions (func) / Transformations
    • mapPartitionsWithSplit (func) / Transformations
    • Sample (withReplacement,fraction, seed) / Transformations
    • Union (otherDataset) / Transformations
    • Distinct ([numTasks])) / Transformations
    • groupByKey ([numTasks]) / Transformations
    • reduceByKey (func, [numTasks]) / Transformations
    • sortByKey ([ascending], [numTasks]) / Transformations
    • Join (otherDataset, [numTasks]) / Transformations
    • Cogroup (otherDataset, [numTasks]) / Transformations
    • Cartesian (otherDataset) / Transformations
  • Twitter
    • about / Spark streaming

U

  • use cases, Hadoop
    • about / The Hadoop use cases
  • User Defined Functions (UDF)
    • about / Distributed programming

V

  • V's, of big data
    • about / V's of big data
    • volume / Volume
    • velocity / Velocity
    • variety / Variety

W

  • WORM (write once, read many)
    • about / Features of HDFS
  • Write Ahead Log (WAL)
    • about / Reliability

Y

  • YARN
    • about / Pillars of Hadoop, Apache YARN, YARN
    • architecture / YARN architecture
    • applications / Applications powered by YARN