Book Image

Apache Flume: Distributed Log Collection for Hadoop

By : Steve Hoffman, Steven Hoffman
Book Image

Apache Flume: Distributed Log Collection for Hadoop

By: Steve Hoffman, Steven Hoffman

Overview of this book

Table of Contents (16 chapters)
Apache Flume: Distributed Log Collection for Hadoop Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

HDFS sink


The job of the HDFS sink is to continuously open a file in HDFS, stream data into it, and at some point, close that file and start a new one. As we discussed in Chapter 1, Overview and Architecture, the time between files rotations must be balanced with how quickly files are closed in HDFS, thus making the data visible for processing. As we've discussed, having lots of tiny files for input will make your MapReduce jobs inefficient.

To use the HDFS sink, set the type parameter on your named sink to hdfs.

agent.sinks.k1.type=hdfs

This defines a HDFS sink named k1 for the agent named agent. There are some additional parameters you must specify, starting with the path in HDFS you want to write the data to:

agent.sinks.k1.hdfs.path=/path/in/hdfs

This HDFS path, like most file paths in Hadoop, can be specified in three different ways: absolute, absolute with server name, and relative. These are all equivalent (assuming your Flume agent is run as the flume user):

absolute

/Users/flume...