Apache Flume: Distributed Log Collection for Hadoop

Hadoop is a great open source tool for shifting tons of unstructured data into something manageable so that your business can gain better insight into your customers' needs. It's cheap (mostly free), scales horizontally as long as you have space and power in your datacenter, and can handle problems that would crush your traditional data warehouse. That said, a little-known secret is that your Hadoop cluster requires you to feed it data. Otherwise, you just have a very expensive heat generator! You will quickly realize (once you get past the "playing around" phase with Hadoop) that you will need a tool to automatically feed data into your cluster. In the past, you had to come up with a solution for this problem, but no more! Flume was started as a project out of Cloudera, when its integration engineers had to keep writing tools over and over again for their customers to automatically import data. Today, the project lives with the Apache Foundation, is under active development, and boasts of users who have been using it in their production environments for years.

In this book, I hope to get you up and running quickly with an architectural overview of Flume and a quick-start guide. After that, we'll dive deep into the details of many of the more useful Flume components, including the very important file channel for the persistence of in-flight data records and the HDFS Sink for buffering and writing data into HDFS (the Hadoop File System). Since Flume comes with a wide variety of modules, chances are that the only tool you'll need to get started is a text editor for the configuration file.

By the time you reach the end of this book, you should know enough to build a highly available, fault-tolerant, streaming data pipeline that feeds your Hadoop cluster.

What this book covers

Chapter 1, Overview and Architecture, introduces Flume and the problem space that it's trying to address (specifically with regards to Hadoop). An architectural overview of the various components to be covered in later chapters is given.

Chapter 2, A Quick Start Guide to Flume, serves to get you up and running quickly. It includes downloading Flume, creating a "Hello, World!" configuration, and running it.

Chapter 3, Channels, covers the two major channels most people will use and the configuration options available for each of them.

Chapter 4, Sinks and Sink Processors, goes into great detail on using the HDFS Flume output, including compression options and options for formatting the data. Failover options are also covered so that you can create a more robust data pipeline.

Chapter 5, Sources and Channel Selectors, introduces several of the Flume input mechanisms and their configuration options. Also covered is switching between different channels based on data content, which allows the creation of complex data flows.

Chapter 6, Interceptors, ETL, and Routing, explains how to transform data in-flight as well as extract information from the payload to use with Channel Selectors to make routing decisions. Then this chapter covers tiering Flume agents using Avro serialization, as well as using the Flume command line as a standalone Avro client for testing and importing data manually.

Chapter 7, Putting It All Together, walks you through the details of an end-to-end use case from the web server logs to a searchable UI, backed by Elasticsearch as well as archival storage in HDFS.

Chapter 8, Monitoring Flume, discusses various options available for monitoring Flume both internally and externally, including Monit, Nagios, Ganglia, and custom hooks.

Chapter 9, There Is No Spoon – the Realities of Real-time Distributed Data Collection, is a collection of miscellaneous things to consider that are outside the scope of just configuring and using Flume.

What you need for this book

You'll need a computer with a Java Virtual Machine installed, since Flume is written in Java. If you don't have Java on your computer, you can download it from http://java.com/.

You will also need an Internet connection so that you can download Flume to run the Quick Start example.

This book covers Apache Flume 1.5.2.

Who this book is for

This book is for people responsible for implementing the automatic movement of data from various systems to a Hadoop cluster. If it is your job to load data into Hadoop on a regular basis, this book should help you to code yourself out of manual monkey work or from writing a custom tool you'll be supporting for as long as you work at your company.

Only basic knowledge of Hadoop and HDFS is required. Some custom implementations are covered, should your needs necessitate them. For this level of implementation, you will need to know how to program in Java.

Finally, you'll need your favorite text editor, since most of this book covers how to configure various Flume components via an agent's text configuration file.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and explanations of their meanings.

Code words in text are shown as follows: "If you want to use this feature, you set the useDualCheckpoints property to true and specify a location for that second checkpoint directory with the backupCheckpointDir property."

A block of code is set as follows:

agent.sinks.k1.hdfs.path=/logs/apache/access
agent.sinks.k1.hdfs.filePrefix=access
agent.sinks.k1.hdfs.fileSuffix=.log

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

agent.sources.s1.command=uptime
agent.sources.s1.restart=true
agent.sources.s1.restartThrottle=60000

Any command-line input or output is written as follows:

$ tar -zxf apache-flume-1.5.2.tar.gz
$ cd apache-flume-1.5.2

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Flume was first introduced in Cloudera's CDH3 distribution in 2011."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Related Content you might be interested in

Current Title:

Apache Flume: Distributed Log Collection for Hadoop

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions