Hadoop MapReduce Cookbook

Hadoop MapReduce Cookbook

By : Srinath Perera, Thilina Gunarathne

Buy this Book

Hadoop MapReduce Cookbook

By: Srinath Perera, Thilina Gunarathne

Buy this Book

Overview of this book

We are facing an avalanche of data. The unstructured data we gather can contain many insights that might hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop MapReduce is one of the most highly sought after skills in today's job market. "Hadoop MapReduce Cookbook" is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases. "Hadoop MapReduce Cookbook" presents more than 50 ready-to-use Hadoop MapReduce recipes in a simple and straightforward manner, with step-by-step instructions and real world examples. Start with how to install, then configure, extend, and administer Hadoop. Then write simple examples, learn MapReduce patterns, harness the Hadoop landscape, and finally jump to the cloud. The book deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations. "Hadoop MapReduce Cookbook" teaches you how process large and complex data sets using real examples providing a comprehensive guide to get things done using Hadoop MapReduce.

Hadoop MapReduce Cookbook

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Hadoop Up and Running in a Cluster

Introduction

Setting up Hadoop on your machine

Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop

Adding the combiner step to the WordCount MapReduce program

Setting up HDFS

Using HDFS monitoring UI

HDFS basic command-line file operations

Setting Hadoop in a distributed cluster environment

Running the WordCount program in a distributed cluster environment

Using MapReduce monitoring UI

Advanced HDFS

Introduction

Benchmarking HDFS

Adding a new DataNode

Decommissioning DataNodes

Using multiple disks/volumes and limiting HDFS disk usage

Setting HDFS block size

Setting the file replication factor

Using HDFS Java API

Using HDFS C API (libhdfs)

Mounting HDFS (Fuse-DFS)

Merging files in HDFS

Advanced Hadoop MapReduce Administration

Introduction

Tuning Hadoop configurations for cluster deployments

Running benchmarks to verify the Hadoop installation

Reusing Java VMs to improve the performance

Fault tolerance and speculative execution

Debug scripts – analyzing task failures

Setting failure percentages and skipping bad records

Shared-user Hadoop clusters – using fair and other schedulers

Hadoop security – integrating with Kerberos

Using the Hadoop Tool interface

Developing Complex Hadoop MapReduce Applications

Introduction

Choosing appropriate Hadoop data types

Implementing a custom Hadoop Writable data type

Implementing a custom Hadoop key type

Emitting data of different value types from a mapper

Choosing a suitable Hadoop InputFormat for your input data format

Adding support for new input data formats – implementing a custom InputFormat

Formatting the results of MapReduce computations – using Hadoop OutputFormats

Hadoop intermediate (map to reduce) data partitioning

Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache

Using Hadoop with legacy applications – Hadoop Streaming

Adding dependencies between MapReduce jobs

Hadoop counters for reporting custom metrics

Hadoop Ecosystem

Introduction

Installing HBase

Data random access using Java client APIs

Running MapReduce jobs on HBase (table input/output)

Installing Pig

Running your first Pig command

Set operations (join, union) and sorting with Pig

Installing Hive

Running a SQL-style query with Hive

Performing a join with Hive

Installing Mahout

Running K-means with Mahout

Visualizing K-means results

Analytics

Introduction

Simple analytics using MapReduce

Performing Group-By using MapReduce

Calculating frequency distributions and sorting using MapReduce

Plotting the Hadoop results using GNU Plot

Calculating histograms using MapReduce

Calculating scatter plots using MapReduce

Parsing a complex dataset with Hadoop

Joining two datasets using MapReduce

Searching and Indexing

Introduction

Generating an inverted index using Hadoop MapReduce

Intra-domain web crawling using Apache Nutch

Indexing and searching web documents using Apache Solr

Configuring Apache HBase as the backend data store for Apache Nutch

Deploying Apache HBase on a Hadoop cluster

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

ElasticSearch for indexing and searching

Generating the in-links graph for crawled web pages

Classifications, Recommendations, and Finding Relationships

Introduction

Content-based recommendations

Hierarchical clustering

Clustering an Amazon sales dataset

Collaborative filtering-based recommendations

Classification using Naive Bayes Classifier

Assigning advertisements to keywords using the Adwords balance algorithm

Mass Text Data Processing

Introduction

Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python

Data de-duplication using Hadoop Streaming

Loading large datasets to an Apache HBase data store using importtsv and bulkload tools

Creating TF and TF-IDF vectors for the text data

Clustering the text data

Topic discovery using Latent Dirichlet Allocation (LDA)

Document classification using Mahout Naive Bayes classifier

Cloud Deployments: Using Hadoop on Clouds

Introduction

Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)

Saving money by using Amazon EC2 Spot Instances to execute EMR job flows

Executing a Pig script using EMR

Executing a Hive script using EMR

Creating an Amazon EMR job flow using the Command Line Interface

Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR

Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Creating TF and TF-IDF vectors for the text data

Most of the text analysis data mining algorithms operate on vector data. We can use a vector space model to represent text data as a set of vectors. For an example, we can build a vector space model by taking the set of all terms that appear in the dataset and by assigning an index to each term in the term set. Number of terms in the term set is the dimensionality of the resulting vectors and each dimension of the vector corresponds to a term. For each document, the vector contains the number of occurrences of each term at the index location assigned to that particular term. This creates vector space model using term frequencies in each document, similar to the result of the computation we perform in the Generating an inverted index using Hadoop MapReduce recipe of Chapter 7, Searching and Indexing.

The term frequencies and the resulting document vectors

However, creating vectors using the preceding term count model gives a lot of weight to...

Hadoop MapReduce Cookbook

By : Srinath Perera, Thilina Gunarathne

Hadoop MapReduce Cookbook

By: Srinath Perera, Thilina Gunarathne

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop MapReduce Cookbook

Creating TF and TF-IDF vectors for the text data