The job of the HDFS sink is to continuously open a file in HDFS, stream data into it, and at some point, close that file and start a new one. As we discussed in Chapter 1, Overview and Architecture, the time between files rotations must be balanced with how quickly files are closed in HDFS, thus making the data visible for processing. As we've discussed, having lots of tiny files for input will make your MapReduce jobs inefficient.
To use the HDFS sink, set the
type parameter on your named sink to
This defines a HDFS sink named
k1 for the agent named
agent. There are some additional parameters you must specify, starting with the path in HDFS you want to write the data to:
This HDFS path, like most file paths in Hadoop, can be specified in three different ways: absolute, absolute with server name, and relative. These are all equivalent (assuming your Flume agent is run as the