Book Image

Hadoop MapReduce Cookbook

By : Srinath Perera, Thilina Gunarathne

Book Image

Hadoop MapReduce Cookbook

By: Srinath Perera, Thilina Gunarathne

Overview of this book

<p>We are facing an avalanche of data. The unstructured data we gather can contain many insights that might hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop MapReduce is one of the most highly sought after skills in today's job market.<br /><br />"Hadoop MapReduce Cookbook" is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases.</p> <p>"Hadoop MapReduce Cookbook" presents more than 50 ready-to-use Hadoop MapReduce recipes in a simple and straightforward manner, with step-by-step instructions and real world examples.<br /><br />Start with how to install, then configure, extend, and administer Hadoop. Then write simple examples, learn MapReduce patterns, harness the Hadoop landscape, and finally jump to the cloud.<br /><br />The book deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.<br /><br />"Hadoop MapReduce Cookbook" teaches you how process large and complex data sets using real examples providing a comprehensive guide to get things done using Hadoop MapReduce.</p>

Hadoop MapReduce Cookbook

Hadoop MapReduce Cookbook

Credits

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Hadoop Up and Running in a Cluster

Getting Hadoop Up and Running in a Cluster

Setting up Hadoop on your machine

Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop

Adding the combiner step to the WordCount MapReduce program

Setting up HDFS

Using HDFS monitoring UI

HDFS basic command-line file operations

Setting Hadoop in a distributed cluster environment

Running the WordCount program in a distributed cluster environment

Using MapReduce monitoring UI

Advanced HDFS

Benchmarking HDFS

Adding a new DataNode

Decommissioning DataNodes

Using multiple disks/volumes and limiting HDFS disk usage

Setting HDFS block size

Setting the file replication factor

Using HDFS Java API

Using HDFS C API (libhdfs)

Mounting HDFS (Fuse-DFS)

Merging files in HDFS

Advanced Hadoop MapReduce Administration

Advanced Hadoop MapReduce Administration

Tuning Hadoop configurations for cluster deployments

Running benchmarks to verify the Hadoop installation

Reusing Java VMs to improve the performance

Fault tolerance and speculative execution

Debug scripts – analyzing task failures

Setting failure percentages and skipping bad records

Shared-user Hadoop clusters – using fair and other schedulers

Hadoop security – integrating with Kerberos

Using the Hadoop Tool interface

Developing Complex Hadoop MapReduce Applications

Developing Complex Hadoop MapReduce Applications

Choosing appropriate Hadoop data types

Implementing a custom Hadoop Writable data type

Implementing a custom Hadoop key type

Emitting data of different value types from a mapper

Choosing a suitable Hadoop InputFormat for your input data format

Adding support for new input data formats – implementing a custom InputFormat

Formatting the results of MapReduce computations – using Hadoop OutputFormats

Hadoop intermediate (map to reduce) data partitioning

Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache

Using Hadoop with legacy applications – Hadoop Streaming

Adding dependencies between MapReduce jobs

Hadoop counters for reporting custom metrics

Hadoop Ecosystem

Hadoop Ecosystem

Installing HBase

Data random access using Java client APIs

Running MapReduce jobs on HBase (table input/output)

Running your first Pig command

Set operations (join, union) and sorting with Pig

Installing Hive

Running a SQL-style query with Hive

Performing a join with Hive

Installing Mahout

Running K-means with Mahout

Visualizing K-means results

Analytics

Simple analytics using MapReduce

Performing Group-By using MapReduce

Calculating frequency distributions and sorting using MapReduce

Plotting the Hadoop results using GNU Plot

Calculating histograms using MapReduce

Calculating scatter plots using MapReduce

Parsing a complex dataset with Hadoop

Joining two datasets using MapReduce

Searching and Indexing

Searching and Indexing

Generating an inverted index using Hadoop MapReduce

Intra-domain web crawling using Apache Nutch

Indexing and searching web documents using Apache Solr

Configuring Apache HBase as the backend data store for Apache Nutch

Deploying Apache HBase on a Hadoop cluster

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

ElasticSearch for indexing and searching

Generating the in-links graph for crawled web pages

Classifications, Recommendations, and Finding Relationships

Classifications, Recommendations, and Finding Relationships

Content-based recommendations

Hierarchical clustering

Clustering an Amazon sales dataset

Collaborative filtering-based recommendations

Classification using Naive Bayes Classifier

Assigning advertisements to keywords using the Adwords balance algorithm

Mass Text Data Processing

Mass Text Data Processing

Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python

Data de-duplication using Hadoop Streaming

Loading large datasets to an Apache HBase data store using importtsv and bulkload tools

Creating TF and TF-IDF vectors for the text data

Clustering the text data

Topic discovery using Latent Dirichlet Allocation (LDA)

Document classification using Mahout Naive Bayes classifier

Cloud Deployments: Using Hadoop on Clouds

Cloud Deployments: Using Hadoop on Clouds

Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)

Saving money by using Amazon EC2 Spot Instances to execute EMR job flows

Executing a Pig script using EMR

Executing a Hive script using EMR

Creating an Amazon EMR job flow using the Command Line Interface

Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR

Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

addnl
- about / How it works...
Adwords assigner / There's more...
Adwords balance algorithm
- used, for assigning advertisements to leywords / Assigning advertisements to keywords using the Adwords balance algorithm
- implementing / Assigning advertisements to keywords using the Adwords balance algorithm, Getting ready, How to do it...
- working / How it works...
AdwordsBidGenerator / How it works...
Amazon EC2 Spot Instances
- about / Saving money by using Amazon EC2 Spot Instances to execute EMR job flows
- URL / Saving money by using Amazon EC2 Spot Instances to execute EMR job flows
- used, for executing EMR job flows / Saving money by using Amazon EC2 Spot Instances to execute EMR job flows, How to do it...
Amazon Elastic Compute Cloud (EC2) / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
Amazon Elastic MapReduce (EMR)
- about / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
- used, for running MapReduce computations / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR), How to do it...
Amazon EMR console
- URL / How to do it...
Amazon sales dataset
- clustering / Clustering an Amazon sales dataset, Getting ready
- working / How it works...
Amazon Simple Storage Service (S3) / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
ant-nodeps package
- about / Building libhdfs
ant-trax package
- about / Building libhdfs
Apache Ant
- download link / Getting ready
- URL / Getting ready
Apache Forrest
- URL / Building libhdfs
Apache Gora / Configuring Apache HBase as the backend data store for Apache Nutch
Apache HBase
- configuring, as backend data store for Apache Nutch / Configuring Apache HBase as the backend data store for Apache Nutch, How to do it, How it works...
- deploying, on Hadoop cluster / Deploying Apache HBase on a Hadoop cluster, How to do it, How it works...
- download link / How to do it
Apache HBase Cluster
- deploying, on Amazon EC2 cloud with EMR / Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR, How to do it...
Apache Lucene project / Indexing and searching web documents using Apache Solr
Apache Mahout K-Means clustering algorithm
- about / How to do it...
Apache Nutch
- about / Intra-domain web crawling using Apache Nutch
- used, for intra-domain web crawling / Intra-domain web crawling using Apache Nutch, How to do it...
- Apache HBase, configuring as backend data store / Configuring Apache HBase as the backend data store for Apache Nutch, How to do it, How it works...
- using, with Hadoop/HBase cluster for web crawling / Getting ready, How to do it, How it works
Apache Nutch Ant build / How it works
Apache Nutch search engine
- about / Introduction
Apache Solr
- about / Indexing and searching web documents using Apache Solr
- used, for indexing and searching web documents / Indexing and searching web documents using Apache Solr, How to do it
- working / How it works
Apache tomcat developer list e-mail archives
- URL / Introduction
Apache Whirr
- about / Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
- used, for deploying Hadoop cluster on Amazon E2 cloud / Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment, How to do it..., How it works...
- used, for deploying HBase cluster on Amazon E2 cloud / Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment, How to do it..., How it works...
Apache Whirr binary distribution
- downloading / How to do it...
automake package
- about / Building libhdfs
AWS Access Keys / How to do it...

B

bad records
- setting / Setting failure percentages and skipping bad records, How it works...
benchmarks
- running, for verifying Hadoop installation / Running benchmarks to verify the Hadoop installation, How it works...
- about / Running benchmarks to verify the Hadoop installation
built-in data types
- text / There's more...
- BytesWritable / There's more...
- VIntWritable / There's more...
- VLongWritable / There's more...
- NullWritable / There's more...
- ArrayWritable / There's more...
- TwoDArrayWritable / There's more...
- MapWritable / There's more...
- SortedMapWritable / There's more...

C

<configuration> tag
- about / How to do it...
capacity scheduler
- about / Shared-user Hadoop clusters – using fair and other schedulers, There's more...
classifiers
- about / Classification using Naive Bayes Classifier
CLI
- about / Creating an Amazon EMR job flow using the Command Line Interface
cluster deployments
- Hadoop configurations, tuning / Getting ready, How to do it...
clustering
- about / Clustering the text data
clustering algorithm
- about / Running K-means with Mahout
collaborative filtering-based recommendations
- about / Collaborative filtering-based recommendations
- implementing / Getting ready, How to do it...
- working / How it works...
comapreTo() method / How it works...
combiner
- adding, to WordCount MapReduce program / Adding the combiner step to the WordCount MapReduce program, How to do it...
- about / Adding the combiner step to the WordCount MapReduce program
- activating / How it works...
completebulkload command
- about / How it works...
complex dataset
- parsing, with Hadoop / Parsing a complex dataset with Hadoop, How to do it..., How it works...
computational complexity / How it works...
conf/core-site.xml
- about / How to do it...
- configuration properties / There's more...
conf/hdfs-site.xml
- about / How to do it...
- configuration properties / There's more...
conf/mapred-site.xml
- about / How to do it...
- configuration properties / There's more...
configuration files
- conf/core-site.xml / How to do it...
- conf/hdfs-site.xml / How to do it...
- conf/mapred-site.xml / How to do it...
configuration properties, conf/core-site.xml
- fs.inmemory.size.mb / There's more...
- io.sort.factor / There's more...
- io.file.buffer.size / There's more...
configuration properties, conf/hdfs-site.xml
- dfs.block.size / There's more...
- dfs.namenode.handler.count / There's more...
configuration properties, conf/mapred-site.xml
- mapred.reduce.parallel.copies / There's more...
- mapred.map.child.java.opts / There's more...
- mapred.reduce.child.java.opts / There's more...
- io.sort.mb / There's more...
content-based recommendations
- about / Content-based recommendations
- implementing / Getting ready, How to do it...
- working / How it works...
createRecordReader() method
- about / How it works...
custom Hadoop key type
- implementing / Implementing a custom Hadoop key type, How to do it..., How it works...
custom Hadoop Writable data type
- implementing / Implementing a custom Hadoop Writable data type, How to do it..., How it works...
custom InputFormat
- implementing / Adding support for new input data formats – implementing a custom InputFormat, How to do it...
custom Partitioner
- implementing / Hadoop intermediate (map to reduce) data partitioning
Cygwin / Getting ready

D

data
- emitting, from mapper / Emitting data of different value types from a mapper, How to do it..., How it works...
- grouping, MapReduce used / Performing Group-By using MapReduce, How to do it..., How it works...
data de-duplication
- Hadoop streaming, used / Data de-duplication using Hadoop Streaming, How it works...
- HBase, used / Data de-duplication using HBase
Dataflow language / How to do it...
data mining algorithm
- about / Installing Mahout
DataNodes
- about / Introduction
- adding / Adding a new DataNode
- decommissioning / Decommissioning DataNodes, How to do it...
data preprocessing
- about / Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python
datasets
- joining, MapReduce used / Joining two datasets using MapReduce, Getting ready, How it works...
debug scripts
- about / Debug scripts – analyzing task failures
- writing / How to do it...
decommissioning process
- working / Decommissioning DataNodes
- about / How it works...
DFSIO
- used, for benchmarking / Benchmarking HDFS
- about / Benchmarking HDFS
distributed cache / How it works...
distributed mode, Hadoop installation
- about / Introduction
document classification
- about / Document classification using Mahout Naive Bayes classifier
- Naive Bayes Classifier, used / Document classification using Mahout Naive Bayes classifier, How to do it..., How it works...

E

EC2 console
- URL / How to do it...
ElasticSearch
- about / ElasticSearch for indexing and searching
- URL / ElasticSearch for indexing and searching
- used, for indexing and searching data / How to do it, How it works
- download link / How to do it
- working / How it works
- using / How it works
EMR
- used, for executing Pig script / Executing a Pig script using EMR, How to do it...
- used, for executing Hive script / Executing a Hive script using EMR, How to do it...
- used, for deploying Apache HBase Cluster on Amazon EC2 cloud / Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR, How to do it...
EMR Bootstrap actions
- used, for configuring VMs for EMR jobs / Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs, How to do it..., There's more...
- configure-daemons / There's more...
- configure-hadoop / There's more...
- memory-intensive / There's more...
- run-if / There's more...
EMR CLI
- used, for creating EMR job flow / Creating an Amazon EMR job flow using the Command Line Interface, How to do it...
EMR job flows
- executing, Amazon EC2 Spot Instances used / Saving money by using Amazon EC2 Spot Instances to execute EMR job flows, How to do it...
- creating, CLI used / Creating an Amazon EMR job flow using the Command Line Interface, How to do it...
exclude file / How to do it...

F

failure percentages
- setting / Setting failure percentages and skipping bad records, How to do it...
fair scheduler
- about / Shared-user Hadoop clusters – using fair and other schedulers
fault tolerance
- about / Fault tolerance and speculative execution, How to do it...
FIFO scheduler
- about / Shared-user Hadoop clusters – using fair and other schedulers
file replication factor
- setting / Setting the file replication factor
FileSystem.Create() method / How it works...
FileSystem.create(filePath) method / How it works...
FileSystem object
- configuring / Configuring the FileSystem object
/ Retrieving the list of data blocks of a file
frequency distribution
- about / Calculating frequency distributions and sorting using MapReduce
- calculating, MapReduce used / Calculating frequency distributions and sorting using MapReduce, How it works...
Fuse-DFS project
- mounting / Mounting HDFS (Fuse-DFS), Getting ready, How to do it...
- working / How it works...
- URL / How it works...

G

getDistance() method / How it works...
getFileBlockLocations() function / Retrieving the list of data blocks of a file
getGeoLocation() method / How it works...
getInputSplit() method / How it works...
getLength() method
- about / There's more...
getLocalCacheFiles() method / How it works...
getMerge command / How to do it...
getmerge command
- about / How it works...
getPath() method / How it works...
getSplits() method
- about / There's more...
getTypes() method / How to do it...
getUri() function / Configuring the FileSystem object
GNU Plot
- used, for plotting results / Plotting the Hadoop results using GNU Plot, How to do it..., How it works...
- URL / There's more...
Google
- about / Introduction
Gross National Income (GNI) / Running your first Pig command

H

Hadoop
- about / Introduction
- setting up / Setting up Hadoop on your machine, How to do it...
- URL / How to do it...
- MapReduce program, writing / Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop, How to do it...
- MapReduce program, executing / Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
- setting, in distributed cluster environment / Setting Hadoop in a distributed cluster environment, Getting ready, How to do it...
- used, for parsing complex dataset / Parsing a complex dataset with Hadoop, How to do it..., How it works...
- content-based recommendations / Content-based recommendations
- hierarchical clustering / Hierarchical clustering
- Amazon sales dataset clustering / Clustering an Amazon sales dataset
- collaborative filtering-based recommendations / Collaborative filtering-based recommendations
- Adwords balance algorithm / Assigning advertisements to keywords using the Adwords balance algorithm
Hadoop's Writable-based serialization framework
- about / Choosing appropriate Hadoop data types
Hadoop Aggregate package / How it works...
Hadoop cluster
- Apache HBase, deploying on / Deploying Apache HBase on a Hadoop cluster, How to do it, How it works...
- deploying on Amazon E2 cloud, Apache Whirr used / Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment, How to do it...
- deploying on Amazon E2, Apache Whirr used / How to do it..., How it works...
Hadoop configurations
- tuning / How to do it...
Hadoop counters
- about / Hadoop counters for reporting custom metrics
- used, for reporting custom metrics / Hadoop counters for reporting custom metrics
- working / How it works...
Hadoop data types
- selecting / Choosing appropriate Hadoop data types, How to do it..., There's more...
Hadoop DistributedCache
- about / Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
- used, for retrieving Map and Reduce tasks / How to do it...
- working / How it works...
- used, for distributing archives / Distributing archives using the DistributedCache
- resources, adding from command line / Adding resources to the DistributedCache from the command line
- used, for adding resources to classpath / Adding resources to the classpath using DistributedCache
Hadoop GenericWritable data type / How to do it...
Hadoop InputFormat
- selecting, for input data format / Choosing a suitable Hadoop InputFormat for your input data format
Hadoop installation
- NameNode / Introduction
- DataNodes / Introduction
- JobTracker / Introduction
- TaskTracker / Introduction
- modes / Introduction
- verifying, benchmarks used / Running benchmarks to verify the Hadoop installation, How it works...
Hadoop intermediate data partitioning
- about / Hadoop intermediate (map to reduce) data partitioning
Hadoop Kerberos security
- about / Hadoop security – integrating with Kerberos
- pitfalls / How it works...
Hadoop monitoring UI
- using / Using MapReduce monitoring UI, How to do it...
- working / How it works...
Hadoop OutputFormats
- used, for formatting MapReduce computations results / Formatting the results of MapReduce computations – using Hadoop OutputFormats, How it works...
Hadoop Partitioners
- about / Hadoop intermediate (map to reduce) data partitioning
Hadoop results
- plotting, GNU Plot used / Plotting the Hadoop results using GNU Plot, How to do it..., How it works...
Hadoop scheduler
- changing / Shared-user Hadoop clusters – using fair and other schedulers, How it works...
hadoop script / How to do it...
Hadoop security
- about / Hadoop security – integrating with Kerberos
- Kerberos, integrating with / Hadoop security – integrating with Kerberos, How to do it...
Hadoop Streaming
- about / Using Hadoop with legacy applications – Hadoop Streaming, There's more...
- working / How it works...
- URL / There's more...
Hadoop streaming
- using with Python script-based mapper, for data preprocessing / Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python, How it works..., There's more...
- used, for data de-duplication / Data de-duplication using Hadoop Streaming, How to do it..., How it works...
Hadoop Tool interface
- using / Using the Hadoop Tool interface, How to do it..., How it works...
HADOOP_LOG_DIR
- about / How it works...
hashCode() method / How it works..., How it works...
HashPartitioner partitions
- about / Hadoop intermediate (map to reduce) data partitioning
HBase
- about / Introduction, Installing HBase
- installing / Installing HBase, How to do it...
- downloading / How to do it...
- working / How it works...
- running, in distributed mode / There's more...
- data random access, via Java client APIs / Data random access using Java client APIs, How to do it...
- MapReduce jobs, running / Running MapReduce jobs on HBase (table input/output), How to do it...
- used, for data de-duplication / Data de-duplication using HBase
HBase cluster
- deploying on Amazon E2 cloud, Apache Whirr used / Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment, How to do it..., How it works...
HBase data model
- about / Installing HBase
- reference link / Installing HBase
HBase TableMapper / How it works
HDFS
- about / Setting up HDFS, Introduction
- setting up / Setting up HDFS, How to do it...
- working / How it works...
- benchmarking / Benchmarking HDFS, How to do it...
- DataNode, adding / Adding a new DataNode, How to do it...
- rebalancing / Rebalancing HDFS
- files, merging / Merging files in HDFS
HDFS basic command-line file operations
- executing / HDFS basic command-line file operations, How to do it...
HDFS block size
- setting / Setting HDFS block size, How to do it...
HDFS C API
- using / Using HDFS C API (libhdfs), How to do it...
- working / How it works...
HDFS configuration files
- configuring / Configuring using HDFS configuration files
hdfsConnectAsUser command / How it works...
hdfsConnect command / How it works...
HDFS disk usage
- limiting / Using multiple disks/volumes and limiting HDFS disk usage
HDFS filesystem
- mounting / How to do it...
HDFS Java API
- about / Using HDFS Java API, How to do it..., How it works...
- using / Using HDFS Java API, How to do it...
- working / How it works...
HDFS monitoring UI
- using / Using HDFS monitoring UI
hdfsOpenFile command / How it works...
hdfsRead command / How it works...
HDFS replication factor
- about / Setting the file replication factor
- working / How it works...
HDFS setup
- testing / How to do it...
HDFS web console
- accessing / How to do it...
hierarchical clustering
- about / Hierarchical clustering
- implementing / Hierarchical clustering, How to do it...
- working / How it works...
higher-level programming interfaces
- about / Installing Pig
histograms
- about / Calculating histograms using MapReduce
- calculating, MapReduce used / Calculating histograms using MapReduce, Getting ready, How to do it..., How it works...
Hive
- about / Introduction, Installing Hive
- downloading / How to do it...
- installing / How to do it...
- working / How it works..., How it works...
- SQL-style query, running with / Running a SQL-style query with Hive, Getting ready, How to do it...
- used, for filtering and sorting / How to do it...
- join, performing with / Performing a join with Hive, How to do it..., How it works...
Hive interactive session
- steps / Starting a Hive interactive session
Hive script
- executing, EMR used / Executing a Hive script using EMR, How to do it...
Human Development Report (HDR) / Running a SQL-style query with Hive
Human Development Report (HDR) data / Running your first Pig command

I

importtsv and bulkload
- used, for importing large text dataset to HBase / Loading large datasets to an Apache HBase data store using importtsv and bulkload tools, How to do it…, How it works...
importtsv tool
- about / How it works...
- using / There's more...
in-links graph
- generating, for for crawled web pages / Generating the in-links graph for crawled web pages, How to do it, How it works
InputFormat implementations
- TextInputFormat / There's more...
- NLineInputFormat / There's more...
- SequenceFileInputFormat / There's more...
- DBInputFormat / There's more...
InputSplit object
- about / There's more...
intra-domain web crawling
- Apache Nutch used / Intra-domain web crawling using Apache Nutch, How to do it...
inverted document frequencies (IDF) / Creating TF and TF-IDF vectors for the text data
inverted index
- generating, MapReduce used / Generating an inverted index using Hadoop MapReduce, How to do it..., How it works...

J

Java 1.6
- downloading / Getting ready
- installing / Getting ready
Java client APIs
- used, for connecting HBase / How to do it...
Java Cryptography Extension (JCE) Policy / How to do it...
Java Integrated Development Environment (IDE) / Getting ready
Java JDK 1.6 / Getting ready
Java regular expressions
- URL / There's more...
Java VMs
- reusing, for improving performance / Reusing Java VMs to improve the performance, How it works...
JDK 1.5
- URL / Building libhdfs
JobTracker
- about / Introduction
- setting up / How to do it...
join
- performing, with Hive / Performing a join with Hive, How to do it..., How it works...
JSON snippet / How to do it...

K

K-means
- about / How it works..., Running K-means with Mahout
- running, with Mahout / Running K-means with Mahout, How to do it..., How it works...
K-means results
- visualizing / Visualizing K-means results, How it works...
Kerberos
- integrating with / Hadoop security – integrating with Kerberos
- installing / How to do it...
- principals / How to do it...
Kerberos setup
- about / Hadoop security – integrating with Kerberos
- NameNode / Hadoop security – integrating with Kerberos
- DataNodes / Hadoop security – integrating with Kerberos
- JobTracker / Hadoop security – integrating with Kerberos
- TaskTrackers / Hadoop security – integrating with Kerberos
KeyFieldPartitioner / KeyFieldBasedPartitioner
KeyValueTextInputFormat
- about / How it works...
kinit command / How it works...

L

large text dataset
- importing to HBase, importtsv and bulkload used / Loading large datasets to an Apache HBase data store using importtsv and bulkload tools, Getting ready, How to do it…, How it works..., There's more...
LDA
- about / Topic discovery using Latent Dirichlet Allocation (LDA)
- used, for topic discovery / Topic discovery using Latent Dirichlet Allocation (LDA), How to do it…
libhdfs
- about / Using HDFS C API (libhdfs)
- using / Getting ready
- building / Building libhdfs
Libtool package
- about / Building libhdfs
local mode, Hadoop installation
- about / Introduction
- working / How it works...
LogFileInputFormat
- about / How it works...
LogFileRecordReader class
- about / How it works...
LogWritable class
- about / How it works...

M

machine learning algorithm
- about / Installing Mahout
Mahout
- about / Introduction, Installing Mahout
- installing / How to do it...
- working / How it works...
- K-means, running with / Running K-means with Mahout, How to do it..., How it works...
Mahout installation
- verifying / How to do it...
Mahout K-Means algorithm / How it works...
Mahout seqdumper command / How it works…
Mahout split command
- about / How it works...
map() function / How it works...
MapFile
- about / There's more...
mapper
- data, emitting from / Emitting data of different value types from a mapper, How to do it...
- implementing, for HTTP log processing application / Using Hadoop with legacy applications – Hadoop Streaming, How to do it...
MapReduce
- about / Introduction
- used, for calculating simple analytics / Simple analytics using MapReduce, Getting ready, How to do it..., How it works...
- used, for grouping data / Performing Group-By using MapReduce, How to do it..., How it works...
- used, for calculating frequency distributions / Calculating frequency distributions and sorting using MapReduce, How it works...
- used, for calculating histograms / Calculating histograms using MapReduce, Getting ready, How to do it..., How it works...
- used, for calculating Scatter plots / Calculating scatter plots using MapReduce, Getting ready, How to do it..., How it works...
- used, for joining datasets / Joining two datasets using MapReduce, How to do it..., How it works...
- used, for generating inverted index / Generating an inverted index using Hadoop MapReduce, How to do it..., How it works...
MapReduce application
- MultipleInputs feature, using / Using multiple input data types and multiple mapper implementations in a single MapReduce application
MapReduce computations
- running, Amazon Elastic MapReduce (EMR) used / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR), How to do it...
MapReduce computations results
- formatting, Hadoop OutputFormats used / Formatting the results of MapReduce computations – using Hadoop OutputFormats, How it works...
MapReduce jobs
- dependencies, adding / Adding dependencies between MapReduce jobs, How to do it...
- running, on HBase / Running MapReduce jobs on HBase (table input/output), How to do it...
- working / How it works...
MapReduce monitoring UI
- using / Using MapReduce monitoring UI, How to do it...
- working / How it works...
MBOX format / Joining two datasets using MapReduce
minSupport / How it works…
modes, Hadoop installation
- local mode / Introduction
- Pseudo distributed mode / Introduction
- distributed modes / Introduction
mrbench / There's more...
multi-dimensional space / Clustering an Amazon sales dataset
multiple disks/volumes
- using / Using multiple disks/volumes and limiting HDFS disk usage
MultipleInputs feature
- using, in MapReduce application / Using multiple input data types and multiple mapper implementations in a single MapReduce application

N

20news dataset
- downloading / How to do it…
Naive Bayes Classifier
- about / Classification using Naive Bayes Classifier
- URL / Classification using Naive Bayes Classifier
- implementing / Classification using Naive Bayes Classifier, How to do it...
- working / How it works...
- used, for document classification / Document classification using Mahout Naive Bayes classifier, How to do it..., How it works...
NameNode
- about / Introduction
NASA weblog dataset
- URL / Introduction
nextKeyValue() method
- about / How it works..., How it works...
NLineInputFormat
- about / There's more...
nnbench / There's more...
non-Euclidian space / Clustering an Amazon sales dataset

O

orthogonal axes / Clustering an Amazon sales dataset

P

<path> parameter / How it works...
Partitioner / How it works...
Pattern.compile() method / How it works...
Pig
- about / Introduction, Installing Pig
- installing / How to do it...
- downloading / How to do it...
- join and sort operations, implementing / Set operations (join, union) and sorting with Pig, How to do it..., There's more...
Pig command
- running / Running your first Pig command, How to do it...
- working / How it works...
Pig interactive session
- steps / Starting a Pig interactive session
Pig script
- executing, EMR used / Executing a Pig script using EMR, How to do it...
primitive data types
- IntWritable / There's more...
- LongWritable / There's more...
- BooleanWritable / There's more...
- FloatWritable / There's more...
- ByteWritable / There's more...
principals
- about / Hadoop security – integrating with Kerberos
Pseudo distributed mode, Hadoop installation
- about / Introduction

R

random sample / Clustering an Amazon sales dataset
readFields() method / How it works...
read performance benchmark
- running / How to do it...
rebalancer tool
- about / Rebalancing HDFS
reduce() function / How it works...
reduce() method / How to do it...

S

S3 bucket / How to do it...
Scatter plot
- about / Calculating scatter plots using MapReduce
- calculating, MapReduce used / Calculating scatter plots using MapReduce, Getting ready, How to do it..., How it works...
scheduling
- about / Shared-user Hadoop clusters – using fair and other schedulers
seq2sparse command / How it works…
seqdirectory command / How it works…
SequenceFileInputFormat
- about / There's more...
- SequenceFileAsBinaryInputFormat / There's more...
- SequenceFileAsTextInputFormat / There's more...
setrep command syntax / How it works...
shared-user Hadoop clusters
- about / Shared-user Hadoop clusters – using fair and other schedulers
simple analytics
- calculating, MapReduce used / Simple analytics using MapReduce, How to do it..., How it works...
speculative execution
- about / Fault tolerance and speculative execution, How to do it...
SQL-style query
- running, with Hive / Running a SQL-style query with Hive, How to do it...
SSH server / Getting ready

T

-threshold parameter
- about / Rebalancing HDFS
tab-separated value (TSV)file / How to do it...
TableMapReduceUtil class / How it works
task failures
- analyzing / Debug scripts – analyzing task failures, How to do it..., How it works...
TaskTracker
- about / Introduction
TaskTrackers
- setting up / How to do it...
TeraSort / There's more...
term frequencies (TF) / Creating TF and TF-IDF vectors for the text data
Term frequency-inverse document frequency (TF-IDF) model / Creating TF and TF-IDF vectors for the text data
TestDFSIO / There's more...
testmapredsort job / How it works...
text data
- clustering / Clustering the text data, How to do it..., How it works...
TextInputFormat
- about / There's more...
TextInputFormat class / How it works...
TF and TF-IDF vectors
- creating, for text data / Creating TF and TF-IDF vectors for the text data, Getting ready, How to do it…
- working / How it works…
Topic discovery
- LDA, used / Topic discovery using Latent Dirichlet Allocation (LDA), How to do it…, How it works…
toString() method / There's more...
TotalOrderPartitioner / There's more...
Twahpic / Topic discovery using Latent Dirichlet Allocation (LDA)

V

VMs
- configuring for EMR jobs, EMR Bootstrap actions used / Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs, How to do it...

W

web crawling
- about / Intra-domain web crawling using Apache Nutch
- performing, Apache Nutch used with Hadoop/HBase cluster / Whole web crawling with Apache Nutch using a Hadoop/HBase cluster, How to do it, How it works
web documents
- indexing and searching, Apache Solr used / Indexing and searching web documents using Apache Solr, How to do it
WordCount MapReduce program
- writing / Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop, How to do it...
- working / How it works...
- combiner step, adding / Adding the combiner step to the WordCount MapReduce program, How to do it...
- running, in distributed cluster environment / Running the WordCount program in a distributed cluster environment, How to do it..., How it works...
Writable interface / Choosing appropriate Hadoop data types
write() method / How it works...
write performance benchmark
- running / How to do it...

Z

zipf / How to do it...
zlib-devel package
- about / Building libhdfs