Apache Flume: Distributed Log Collection for Hadoop

If you have used any of the Flume 0.9 releases, you'll notice that the TailSource is no longer a part of Flume. TailSource provided a mechanism to "tail" (http://en.wikipedia.org/wiki/Tail_(Unix)) any file on the system and create Flume events for each line of the file. It could also handle file rotations, so many used the filesystem as a handoff point between the application creating the data (for instance, log4j) and the mechanism responsible for moving those files someplace else (for instance, syslog).

As is the case with both channels and sinks, events are added and removed from a channel as part of a transaction. When you are tailing a file, there is no way to participate properly in a transaction. If failure to write successfully to a channel occurred, or if the channel was simply full (a more likely event than failure), the data couldn't be "put back" as rollback semantics dictate.

Furthermore, if the rate of data written to a file exceeds the rate Flume...

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Related Content you might be interested in

Current Title:

Apache Flume: Distributed Log Collection for Hadoop

The problem with using tail