Apache Flume: Distributed Log Collection for Hadoop

The job of the HDFS sink is to continuously open a file in HDFS, stream data into it, and at some point, close that file and start a new one. As we discussed in Chapter 1, Overview and Architecture, the time between files rotations must be balanced with how quickly files are closed in HDFS, thus making the data visible for processing. As we've discussed, having lots of tiny files for input will make your MapReduce jobs inefficient.

To use the HDFS sink, set the type parameter on your named sink to hdfs.

agent.sinks.k1.type=hdfs

This defines a HDFS sink named k1 for the agent named agent. There are some additional parameters you must specify, starting with the path in HDFS you want to write the data to:

agent.sinks.k1.hdfs.path=/path/in/hdfs

This HDFS path, like most file paths in Hadoop, can be specified in three different ways: absolute, absolute with server name, and relative. These are all equivalent (assuming your Flume agent is run as the flume user):

absolute

/Users/flume...

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Related Content you might be interested in

Current Title:

Apache Flume: Distributed Log Collection for Hadoop

HDFS sink