Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Hadoop MapReduce v2 Cookbook Second Edition

Hadoop MapReduce v2 Cookbook Second Edition

Credits

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Hadoop v2

Getting Started with Hadoop v2

Setting up Hadoop v2 on your local machine

Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode

Adding a combiner step to the WordCount MapReduce program

Setting up HDFS

Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2

Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution

HDFS command-line file operations

Running the WordCount program in a distributed cluster environment

Benchmarking HDFS using DFSIO

Benchmarking Hadoop MapReduce using TeraSort

Cloud Deployments – Using Hadoop YARN on Cloud Environments

Cloud Deployments – Using Hadoop YARN on Cloud Environments

Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce

Saving money using Amazon EC2 Spot Instances to execute EMR job flows

Executing a Pig script using EMR

Executing a Hive script using EMR

Creating an Amazon EMR job flow using the AWS Command Line Interface

Deploying an Apache HBase cluster on Amazon EC2 using EMR

Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Hadoop Essentials – Configurations, Unit Tests, and Other APIs

Hadoop Essentials – Configurations, Unit Tests, and Other APIs

Optimizing Hadoop YARN and MapReduce configurations for cluster deployments

Shared user Hadoop clusters – using Fair and Capacity schedulers

Setting classpath precedence to user-provided JARs

Speculative execution of straggling tasks

Unit testing Hadoop MapReduce applications using MRUnit

Integration testing Hadoop MapReduce applications using MiniYarnCluster

Adding a new DataNode

Decommissioning DataNodes

Using multiple disks/volumes and limiting HDFS disk usage

Setting the HDFS block size

Setting the file replication factor

Using the HDFS Java API

Developing Complex Hadoop MapReduce Applications

Developing Complex Hadoop MapReduce Applications

Choosing appropriate Hadoop data types

Implementing a custom Hadoop Writable data type

Implementing a custom Hadoop key type

Emitting data of different value types from a Mapper

Choosing a suitable Hadoop InputFormat for your input data format

Adding support for new input data formats – implementing a custom InputFormat

Formatting the results of MapReduce computations – using Hadoop OutputFormats

Writing multiple outputs from a MapReduce computation

Hadoop intermediate data partitioning

Secondary sorting – sorting Reduce input values

Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache

Using Hadoop with legacy applications – Hadoop streaming

Adding dependencies between MapReduce jobs

Hadoop counters to report custom metrics

Analytics

Simple analytics using MapReduce

Performing GROUP BY using MapReduce

Calculating frequency distributions and sorting using MapReduce

Plotting the Hadoop MapReduce results using gnuplot

Calculating histograms using MapReduce

Calculating Scatter plots using MapReduce

Parsing a complex dataset with Hadoop

Joining two datasets using MapReduce

Hadoop Ecosystem – Apache Hive

Hadoop Ecosystem – Apache Hive

Getting started with Apache Hive

Creating databases and tables using Hive CLI

Simple SQL-style data querying using Apache Hive

Creating and populating Hive tables and views using Hive query results

Utilizing different storage formats in Hive - storing table data using ORC files

Using Hive built-in functions

Hive batch mode - using a query file

Performing a join with Hive

Creating partitioned Hive tables

Writing Hive User-defined Functions (UDF)

HCatalog – performing Java MapReduce computations on data mapped to Hive tables

HCatalog – writing data to Hive tables from Java MapReduce computations

Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop

Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop

Getting started with Apache Pig

Joining two datasets using Pig

Accessing a Hive table data in Pig using HCatalog

Getting started with Apache HBase

Data random access using Java client APIs

Running MapReduce jobs on HBase

Using Hive to insert data into HBase tables

Getting started with Apache Mahout

Running K-means with Mahout

Importing data to HDFS from a relational database using Apache Sqoop

Exporting data from HDFS to a relational database using Apache Sqoop

Searching and Indexing

Searching and Indexing

Generating an inverted index using Hadoop MapReduce

Intradomain web crawling using Apache Nutch

Indexing and searching web documents using Apache Solr

Configuring Apache HBase as the backend data store for Apache Nutch

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Elasticsearch for indexing and searching

Generating the in-links graph for crawled web pages

Classifications, Recommendations, and Finding Relationships

Classifications, Recommendations, and Finding Relationships

Performing content-based recommendations

Classification using the naïve Bayes classifier

Assigning advertisements to keywords using the Adwords balance algorithm

Mass Text Data Processing

Mass Text Data Processing

Data preprocessing using Hadoop streaming and Python

De-duplicating data using Hadoop streaming

Loading large datasets to an Apache HBase data store – importtsv and bulkload

Creating TF and TF-IDF vectors for the text data

Clustering text data using Apache Mahout

Topic discovery using Latent Dirichlet Allocation (LDA)

Document classification using Mahout Naive Bayes Classifier

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

De-duplicating data using Hadoop streaming

Often, the datasets contain duplicate items that need to be eliminated to ensure the accuracy of the results. In this recipe, we use Hadoop to remove the duplicate mail records in the 20news dataset. These duplicate records are due to the users cross-posting the same message to multiple newsboards.

Getting ready

Make sure Python is installed on your Hadoop compute nodes.

How to do it...

The following steps show how to remove duplicate mails due to cross-posting across the lists, from the 20news dataset:

Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz:
```
$ wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz
$ tar –xzf 20news-19997.tar.gz
```
Upload the extracted data to the HDFS. In order to save the compute time and resources, you can use only a subset of the dataset:
```
$ hdfs dfs -mkdir 20news-all
$ hdfs dfs –put  <extracted_folder> 20news-all
```
We are going to use the MailPreProcessor.py Python...