Hadoop Real-World Solutions Cookbook

Hadoop Real-World Solutions Cookbook

By : Jonathan R. Owens, Jon Lentz, Brian Femiano

Buy this Book

Hadoop Real-World Solutions Cookbook

By: Jonathan R. Owens, Jon Lentz, Brian Femiano

Buy this Book

Overview of this book

Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation. Hadoop Real-World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia. Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo. Hadoop Real-World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.

Hadoop Real-World Solutions Cookbook

Credits

About the Authors

About the Reviewers

www.packtpub.com

Preface

Free Chapter

Hadoop Distributed File System – Importing and Exporting Data

Introduction

Importing and exporting data into HDFS using Hadoop shell commands

Moving data efficiently between clusters using Distributed Copy

Importing data from MySQL into HDFS using Sqoop

Exporting data from HDFS into MySQL using Sqoop

Configuring Sqoop for Microsoft SQL Server

Exporting data from HDFS into MongoDB

Importing data from MongoDB into HDFS

Exporting data from HDFS into MongoDB using Pig

Using HDFS in a Greenplum external table

Using Flume to load data into HDFS

HDFS

Introduction

Reading and writing data to HDFS

Compressing data using LZO

Reading and writing data to SequenceFiles

Using Apache Avro to serialize data

Using Apache Thrift to serialize data

Using Protocol Buffers to serialize data

Setting the replication factor for HDFS

Setting the block size for HDFS

Extracting and Transforming Data

Introduction

Transforming Apache logs into TSV format using MapReduce

Using Apache Pig to filter bot traffic from web server logs

Using Apache Pig to sort web server log data by timestamp

Using Apache Pig to sessionize web server log data

Using Python to extend Apache Pig functionality

Using MapReduce and secondary sort to calculate page views

Using Hive and Python to clean and transform geographical event data

Using Python and Hadoop Streaming to perform a time series analytic

Using MultipleOutputs in MapReduce to name output files

Creating custom Hadoop Writable and InputFormat to read geographical event data

Performing Common Tasks Using Hive, Pig, and MapReduce

Introduction

Using Hive to map an external table over weblog data in HDFS

Using Hive to dynamically create tables from the results of a weblog query

Using the Hive string UDFs to concatenate fields in weblog data

Using Hive to intersect weblog IPs and determine the country

Generating -grams over news archives using MapReduce

Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives

Using Pig to load a table and perform a SELECT operation with GROUP BY

Advanced Joins

Introduction

Joining data in the Mapper using MapReduce

Joining data using Apache Pig replicated join

Joining sorted data using Apache Pig merge join

Joining skewed data using Apache Pig skewed join

Using a map-side join in Apache Hive to analyze geographical events

Using optimized full outer joins in Apache Hive to analyze geographical events

Joining data using an external key-value store (Redis)

Big Data Analysis

Introduction

Counting distinct IPs in weblog data using MapReduce and Combiners

Using Hive date UDFs to transform and sort event dates from geographic event data

Using Hive to build a per-month report of fatalities over geographic event data

Implementing a custom UDF in Hive to help validate source reliability over geographic event data

Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python

Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig

Trim Outliers from the Audioscrobbler dataset using Pig and datafu

Advanced Big Data Analysis

Introduction

PageRank with Apache Giraph

Single-source shortest-path with Apache Giraph

Using Apache Giraph to perform a distributed breadth-first search

Collaborative filtering with Apache Mahout

Clustering with Apache Mahout

Sentiment classification with Apache Mahout

Debugging

Introduction

Using Counters in a MapReduce job to track bad records

Developing and testing MapReduce jobs with MRUnit

Developing and testing MapReduce jobs running in local mode

Enabling MapReduce jobs to skip bad records

Using Counters in a streaming job

Updating task status messages to display debugging information

Using illustrate to debug Pig jobs

System Administration

Introduction

Starting Hadoop in pseudo-distributed mode

Starting Hadoop in distributed mode

Adding new nodes to an existing cluster

Safely decommissioning nodes

Recovering from a NameNode failure

Monitoring cluster health using Ganglia

Tuning MapReduce job parameters

Persistence Using Apache Accumulo

Introduction

Designing a row key to store geographic events in Accumulo

Using MapReduce to bulk import geographic event data into Accumulo

Setting a custom field constraint forinputting geographic event data in Accumulo

Limiting query results using the regex filtering iterator

Counting fatalities for different versions of the same key using SumCombiner

Enforcing cell-level security on scans using Accumulo

Aggregating sources in Accumulo using MapReduce

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Enabling MapReduce jobs to skip bad records

When working with the amounts of data that Hadoop was designed to process, it is only a matter of time before even the most robust job runs into unexpected or malformed data. If not handled properly, bad data can easily cause a job to fail. By default, Hadoop will not skip bad data. For some applications, it may be acceptable to skip a small percentage of the input data. Hadoop provides a way to do just that. Even if skipping data is not acceptable for a given use case, Hadoop's skipping mechanism can be used to pinpoint the bad data and log it for review.

How to do it...

To enable the skipping of 100 bad records in a map job, add the following to the run() method where the job configuration is set up:
```
SkipBadRecords.setMapperMaxSkipRecords(conf, 100);
```
To enable the skipping of 100 bad record groups in a reduce job, add the following to the run() method where the job configuration is set up:
```
SkipBadRecords.setReducerMaxSkipGroups(conf, 100);
```

Hadoop Real-World Solutions Cookbook

By : Jonathan R. Owens, Jon Lentz, Brian Femiano

Hadoop Real-World Solutions Cookbook

By: Jonathan R. Owens, Jon Lentz, Brian Femiano

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop Real-World Solutions Cookbook

Enabling MapReduce jobs to skip bad records

How to do it...

How it...