Hadoop Real-World Solutions Cookbook

Hadoop Real-World Solutions Cookbook

By : Jonathan R. Owens, Jon Lentz, Brian Femiano

Buy this Book

Hadoop Real-World Solutions Cookbook

By: Jonathan R. Owens, Jon Lentz, Brian Femiano

Buy this Book

Overview of this book

Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation. Hadoop Real-World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia. Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo. Hadoop Real-World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.

Hadoop Real-World Solutions Cookbook

Credits

About the Authors

About the Reviewers

www.packtpub.com

Preface

Free Chapter

Hadoop Distributed File System – Importing and Exporting Data

Introduction

Importing and exporting data into HDFS using Hadoop shell commands

Moving data efficiently between clusters using Distributed Copy

Importing data from MySQL into HDFS using Sqoop

Exporting data from HDFS into MySQL using Sqoop

Configuring Sqoop for Microsoft SQL Server

Exporting data from HDFS into MongoDB

Importing data from MongoDB into HDFS

Exporting data from HDFS into MongoDB using Pig

Using HDFS in a Greenplum external table

Using Flume to load data into HDFS

HDFS

Introduction

Reading and writing data to HDFS

Compressing data using LZO

Reading and writing data to SequenceFiles

Using Apache Avro to serialize data

Using Apache Thrift to serialize data

Using Protocol Buffers to serialize data

Setting the replication factor for HDFS

Setting the block size for HDFS

Extracting and Transforming Data

Introduction

Transforming Apache logs into TSV format using MapReduce

Using Apache Pig to filter bot traffic from web server logs

Using Apache Pig to sort web server log data by timestamp

Using Apache Pig to sessionize web server log data

Using Python to extend Apache Pig functionality

Using MapReduce and secondary sort to calculate page views

Using Hive and Python to clean and transform geographical event data

Using Python and Hadoop Streaming to perform a time series analytic

Using MultipleOutputs in MapReduce to name output files

Creating custom Hadoop Writable and InputFormat to read geographical event data

Performing Common Tasks Using Hive, Pig, and MapReduce

Introduction

Using Hive to map an external table over weblog data in HDFS

Using Hive to dynamically create tables from the results of a weblog query

Using the Hive string UDFs to concatenate fields in weblog data

Using Hive to intersect weblog IPs and determine the country

Generating -grams over news archives using MapReduce

Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives

Using Pig to load a table and perform a SELECT operation with GROUP BY

Advanced Joins

Introduction

Joining data in the Mapper using MapReduce

Joining data using Apache Pig replicated join

Joining sorted data using Apache Pig merge join

Joining skewed data using Apache Pig skewed join

Using a map-side join in Apache Hive to analyze geographical events

Using optimized full outer joins in Apache Hive to analyze geographical events

Joining data using an external key-value store (Redis)

Big Data Analysis

Introduction

Counting distinct IPs in weblog data using MapReduce and Combiners

Using Hive date UDFs to transform and sort event dates from geographic event data

Using Hive to build a per-month report of fatalities over geographic event data

Implementing a custom UDF in Hive to help validate source reliability over geographic event data

Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python

Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig

Trim Outliers from the Audioscrobbler dataset using Pig and datafu

Advanced Big Data Analysis

Introduction

PageRank with Apache Giraph

Single-source shortest-path with Apache Giraph

Using Apache Giraph to perform a distributed breadth-first search

Collaborative filtering with Apache Mahout

Clustering with Apache Mahout

Sentiment classification with Apache Mahout

Debugging

Introduction

Using Counters in a MapReduce job to track bad records

Developing and testing MapReduce jobs with MRUnit

Developing and testing MapReduce jobs running in local mode

Enabling MapReduce jobs to skip bad records

Using Counters in a streaming job

Updating task status messages to display debugging information

Using illustrate to debug Pig jobs

System Administration

Introduction

Starting Hadoop in pseudo-distributed mode

Starting Hadoop in distributed mode

Adding new nodes to an existing cluster

Safely decommissioning nodes

Recovering from a NameNode failure

Monitoring cluster health using Ganglia

Tuning MapReduce job parameters

Persistence Using Apache Accumulo

Introduction

Designing a row key to store geographic events in Accumulo

Using MapReduce to bulk import geographic event data into Accumulo

Setting a custom field constraint forinputting geographic event data in Accumulo

Limiting query results using the regex filtering iterator

Counting fatalities for different versions of the same key using SumCombiner

Enforcing cell-level security on scans using Accumulo

Aggregating sources in Accumulo using MapReduce

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

Hadoop Real-World Solutions Cookbook helps developers become more comfortable with, and proficient at solving problems in, the Hadoop space. Readers will become more familiar with a wide variety of Hadoop-related tools and best practices for implementation.

This book will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.

This book provides in-depth explanations and code examples. Each chapter contains a set of recipes that pose, and then solve, technical challenges and that can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. This book covers unloading/loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine-learning approaches with Mahout, debugging and troubleshooting MapReduce jobs, and columnar storage and retrieval of structured data using Apache Accumulo.

This book will give readers the examples they need to apply the Hadoop technology to their own problems.

What this book covers

Chapter 1, Hadoop Distributed File System – Importing and Exporting Data, shows several approaches for loading and unloading data from several popular databases that include MySQL, MongoDB, Greenplum, and MS SQL Server, among others, with the aid of tools such as Pig, Flume, and Sqoop.

Chapter 2, HDFS, includes recipes for reading and writing data to/from HDFS. It shows how to use different serialization libraries, including Avro, Thrift, and Protocol Buffers. Also covered is how to set the block size and replication, and enable LZO compression.

Chapter 3, Extracting and Transforming Data, includes recipes that show basic Hadoop ETL over several different types of data sources. Different tools, including Hive, Pig, and the Java MapReduce API, are used to batch-process data samples and produce one or more transformed outputs.

Chapter 4, Performing Common Tasks Using Hive, Pig, and MapReduce, focuses on how to leverage certain functionality in these tools to quickly tackle many different classes of problems. This includes string concatenation, external table mapping, simple table joins, custom functions, and dependency distribution across the cluster.

Chapter 5, Advanced Joins, contains recipes that demonstrate more complex and useful join techniques in MapReduce, Hive, and Pig. These recipes show merged, replicated, and skewed joins in Pig as well as Hive map-side and full outer joins. There is also a recipe that shows how to use Redis to join data from an external data store.

Chapter 6, Big Data Analysis, contains recipes designed to show how you can put Hadoop to use to answer different questions about your data. Several of the Hive examples will demonstrate how to properly implement and use a custom function (UDF) for reuse in different analytics. There are two Pig recipes that show different analytics with the Audioscrobbler dataset and one MapReduce Java API recipe that shows Combiners.

Chapter 7, Advanced Big Data Analysis, shows recipes in Apache Giraph and Mahout that tackle different types of graph analytics and machine-learning challenges.

Chapter 8, Debugging, includes recipes designed to aid in the troubleshooting and testing of MapReduce jobs. There are examples that use MRUnit and local mode for ease of testing. There are also recipes that emphasize the importance of using counters and updating task status to help monitor the MapReduce job.

Chapter 9, System Administration, focuses mainly on how to performance-tune and optimize the different settings available in Hadoop. Several different topics are covered, including basic setup, XML configuration tuning, troubleshooting bad data nodes, handling NameNode failure, and performance monitoring using Ganglia.

Chapter 10, Persistence Using Apache Accumulo, contains recipes that show off many of the unique features and capabilities that come with using the NoSQL datastore Apache Accumulo. The recipes leverage many of its unique features, including iterators, combiners, scan authorizations, and constraints. There are also examples for building an efficient geospatial row key and performing batch analysis using MapReduce.

What you need for this book

Readers will need access to a pseudo-distributed (single machine) or fully-distributed (multi-machine) cluster to execute the code in this book. The various tools that the recipes leverage need to be installed and properly configured on the cluster. Moreover, the code recipes throughout this book are written in different languages; therefore, it’s best if readers have access to a machine with development tools they are comfortable using.

Who this book is for

This book uses concise code examples to highlight different types of real-world problems you can solve with Hadoop. It is designed for developers with varying levels of comfort using Hadoop and related tools. Hadoop beginners can use the recipes to accelerate the learning curve and see real-world examples of Hadoop application. For more experienced Hadoop developers, many of the tools and techniques might expose them to new ways of thinking or help clarify a framework they had heard of but the value of which they had not really understood.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: “All of the Hadoop filesystem shell commands take the general form hadoop fs –COMMAND.”

A block of code is set as follows:

weblogs = load ‘/data/weblogs/weblog_entries.txt’ as 
                (md5:chararray, 
                  url:chararray, 
                  date:chararray, 
                  time:chararray, 
                  ip:chararray);

md5_grp = group weblogs by md5 parallel 4;

store md5_grp into ‘/data/weblogs/weblogs_md5_groups.bcp’;

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

weblogs = load ‘/data/weblogs/weblog_entries.txt’ as 
                (md5:chararray, 
                  url:chararray, 
                  date:chararray, 
                  time:chararray, 
                  ip:chararray);

md5_grp = group weblogs by md5 parallel 4;

store md5_grp into ‘/data/weblogs/weblogs_md5_groups.bcp’;

Any command-line input or output is written as follows:

hadoop distcp –m 10 hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: “To build the JAR file, download the Jython java installer, run the installer, and select Standalone from the installation menu”.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Hadoop Real-World Solutions Cookbook

By : Jonathan R. Owens, Jon Lentz, Brian Femiano

Hadoop Real-World Solutions Cookbook

By: Jonathan R. Owens, Jon Lentz, Brian Femiano

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop Real-World Solutions Cookbook

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions