Hadoop Beginner's Guide

Hadoop Beginner's Guide

Overview of this book

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills."Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

What It's All About

Big data processing

Cloud computing with Amazon Web Services

Summary

Getting Hadoop Up and Running

Hadoop on a local Ubuntu host

Time for action – checking the prerequisites

Time for action – downloading Hadoop

Time for action – setting up SSH

Time for action – using Hadoop to calculate Pi

Time for action – configuring the pseudo-distributed mode

Time for action – changing the base HDFS directory

Time for action – formatting the NameNode

Time for action – starting Hadoop

Time for action – using HDFS

Time for action – WordCount, the Hello World of MapReduce

Using Elastic MapReduce

Time for action – WordCount on EMR using the management console

Comparison of local versus EMR Hadoop

Summary

Understanding MapReduce

Key/value pairs

The Hadoop Java API for MapReduce

Writing MapReduce programs

Time for action – setting up the classpath

Time for action – implementing WordCount

Time for action – building a JAR file

Time for action – running WordCount on a local Hadoop cluster

Time for action – running WordCount on EMR

Time for action – WordCount the easy way

Walking through a run of WordCount

Time for action – WordCount with a combiner

Time for action – fixing WordCount to work with a combiner

Hadoop-specific data types

Time for action – using the Writable wrapper classes

Input/output

Summary

Developing MapReduce Programs

Using languages other than Java with Hadoop

Time for action – implementing WordCount using Streaming

Analyzing a large dataset

Time for action – summarizing the UFO data

Time for action – summarizing the shape data

Time for action – correlating of sighting duration to UFO shape

Time for action – performing the shape/time analysis from the command line

Time for action – using ChainMapper for field validation/analysis

Time for action – using the Distributed Cache to improve location output

Counters, status, and other output

Time for action – creating counters, task states, and writing log output

Summary

Advanced MapReduce Techniques

Simple, advanced, and in-between

Joins

Time for action – reduce-side join using MultipleInputs

Graph algorithms

Time for action – representing the graph

Time for action – creating the source code

Time for action – the first run

Time for action – the second run

Time for action – the third run

Time for action – the fourth and last run

Using language-independent data structures

Time for action – getting and installing Avro

Time for action – defining the schema

Time for action – creating the source Avro data with Ruby

Time for action – consuming the Avro data with Java

Time for action – generating shape summaries in MapReduce

Time for action – examining the output data with Ruby

Time for action – examining the output data with Java

Summary

When Things Break

Failure

Time for action – killing a DataNode process

Time for action – the replication factor in action

Time for action – intentionally causing missing blocks

Time for action – killing a TaskTracker process

Time for action – killing the JobTracker

Time for action – killing the NameNode process

Time for action – causing task failure

Time for action – handling dirty data by using skip mode

Summary

Keeping Things Running

A note on EMR

Hadoop configuration properties

Time for action – browsing default properties

Setting up a cluster

Time for action – examining the default rack configuration

Time for action – adding a rack awareness script

Cluster access control

Time for action – demonstrating the default security

Managing the NameNode

Time for action – adding an additional fsimage location

Time for action – swapping to a new NameNode host

Managing HDFS

MapReduce management

Time for action – changing job priorities and killing a job

Scaling

Summary

A Relational View on Data with Hive

Overview of Hive

Setting up Hive

Time for action – installing Hive

Using Hive

Time for action – creating a table for the UFO data

Time for action – inserting the UFO data

Time for action – validating the table

Time for action – redefining the table with the correct column separator

Time for action – creating a table from an existing file

Time for action – performing a join

Time for action – using views

Time for action – exporting query output

Time for action – making a partitioned UFO sighting table

Time for action – adding a new User Defined Function (UDF)

Hive on Amazon Web Services

Time for action – running UFO analysis on EMR

Summary

Working with Relational Databases

Common data paths

Setting up MySQL

Time for action – installing and setting up MySQL

Time for action – configuring MySQL to allow remote connections

Time for action – setting up the employee database

Getting data into Hadoop

Time for action – downloading and configuring Sqoop

Time for action – exporting data from MySQL to HDFS

Time for action – exporting data from MySQL into Hive

Time for action – a more selective import

Time for action – using a type mapping

Time for action – importing data from a raw query

Getting data out of Hadoop

Time for action – importing data from Hadoop into MySQL

Time for action – importing Hive data into MySQL

Time for action – fixing the mapping and re-running the export

AWS considerations

Summary

Data Collection with Flume

A note about AWS

Data data everywhere...

Time for action – getting web server data into Hadoop

Introducing Apache Flume

Time for action – installing and configuring Flume

Time for action – capturing network traffic in a log file

Time for action – logging to the console

Time for action – capturing the output of a command to a flat file

Time for action – capturing a remote file in a local flat file

Time for action – writing network traffic onto HDFS

Time for action – adding timestamps

Time for action – multi level Flume networks

Time for action – writing to multiple sinks

The bigger picture

Summary

Where to Go Next

What we did and didn't cover in this book

Upcoming Hadoop changes

Alternative distributions

Other Apache projects

Other programming abstractions

AWS resources

Sources of information

Summary

Pop Quiz Answers

Chapter 3, Understanding MapReduce

Chapter 7, Keeping Things Running

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

This book is here to help you make sense of Hadoop and use it to solve your big data problems. It's a really exciting time to work with data processing technologies such as Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of large corporations and government agencies—is now possible through free open source software (OSS).

But because of the seeming complexity and pace of change in this area, getting a grip on the basics can be somewhat intimidating. That's where this book comes in, giving you an understanding of just what Hadoop is, how it works, and how you can use it to extract value from your data now.

In addition to an explanation of core Hadoop, we also spend several chapters exploring other technologies that either use Hadoop or integrate with it. Our goal is to give you an understanding not just of what Hadoop is but also how to use it as a part of your broader technical infrastructure.

A complementary technology is the use of cloud computing, and in particular, the offerings from Amazon Web Services. Throughout the book, we will show you how to use these services to host your Hadoop workloads, demonstrating that not only can you process large data volumes, but also you don't actually need to buy any physical hardware to do so.

What this book covers

This book comprises of three main parts: chapters 1 through 5, which cover the core of Hadoop and how it works, chapters 6 and 7, which cover the more operational aspects of Hadoop, and chapters 8 through 11, which look at the use of Hadoop alongside other products and technologies.

Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and cloud computing such important technologies today.

Chapter 2, Getting Hadoop Up and Running, walks you through the initial setup of a local Hadoop cluster and the running of some demo jobs. For comparison, the same work is also executed on the hosted Hadoop Amazon service.

Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how MapReduce jobs are executed and shows how to write applications using the Java API.

Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data set to demonstrate techniques to help when deciding how to approach the processing and analysis of a new data source.

Chapter 5, Advanced MapReduce Techniques, looks at a few more sophisticated ways of applying MapReduce to problems that don't necessarily seem immediately applicable to the Hadoop processing model.

Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault tolerance in some detail and sees just how good it is by intentionally causing havoc through killing processes and intentionally using corrupt data.

Chapter 7, Keeping Things Running, takes a more operational view of Hadoop and will be of most use for those who need to administer a Hadoop cluster. Along with demonstrating some best practice, it describes how to prepare for the worst operational disasters so you can sleep at night.

Chapter 8, A Relational View On Data With Hive, introduces Apache Hive, which allows Hadoop data to be queried with a SQL-like syntax.

Chapter 9, Working With Relational Databases, explores how Hadoop can be integrated with existing databases, and in particular, how to move data from one to the other.

Chapter 10, Data Collection with Flume, shows how Apache Flume can be used to gather data from multiple sources and deliver it to destinations such as Hadoop.

Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop ecosystem, highlighting other products and technologies of potential interest. In addition, it gives some ideas on how to get involved with the Hadoop community and to get help.

What you need for this book

As we discuss the various Hadoop-related software packages used in this book, we will describe the particular requirements for each chapter. However, you will generally need somewhere to run your Hadoop cluster.

In the simplest case, a single Linux-based machine will give you a platform to explore almost all the exercises in this book. We assume you have a recent distribution of Ubuntu, but as long as you have command-line Linux familiarity any modern distribution will suffice.

Some of the examples in later chapters really need multiple machines to see things working, so you will require access to at least four such hosts. Virtual machines are completely acceptable; they're not ideal for production but are fine for learning and exploration.

Since we also explore Amazon Web Services in this book, you can run all the examples on EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout the book. AWS services are usable by anyone, but you will need a credit card to sign up!

Who this book is for

We assume you are reading this book because you want to know more about Hadoop at a hands-on level; the key audience is those with software development experience but no prior exposure to Hadoop or similar big data technologies.

For developers who want to know how to write MapReduce applications, we assume you are comfortable writing Java programs and are familiar with the Unix command-line interface. We will also show you a few programs in Ruby, but these are usually only to demonstrate language independence, and you don't need to be a Ruby expert.

For architects and system administrators, the book also provides significant value in explaining how Hadoop works, its place in the broader architecture, and how it can be managed operationally. Some of the more involved techniques in Chapter 4, Developing MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably of less direct interest to this audience.

Conventions

In this book, you will find several headings appearing frequently.

To give clear instructions of how to complete a procedure or task, we use:

Time for action – heading

Action 1
Action 2
Action 3

Instructions often need some extra explanation so that they make sense, so they are followed with:

What just happened?

This heading explains the working of tasks or instructions that you have just completed.

You will also find some other learning aids in the book, including:

Pop quiz – heading

These are short multiple-choice questions intended to help you test your own understanding.

Have a go hero – heading

These set practical challenges and give you ideas for experimenting with what you have learned.

You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command."

A block of code is set as follows:

# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300

Any command-line input or output is written as follows:

cd /ProgramData/Propeople
rm -r Drush
git clone --branch master http://git.drupal.org/project/drush.git

Newterms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "On the Select Destination Location screen, click on Next to accept the default destination."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Hadoop Beginner's Guide

Hadoop Beginner's Guide

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop Beginner's Guide

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop quiz – heading

Have a go hero – heading

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions