Book Image

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman
Book Image

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Table of Contents (16 chapters)
Apache Flume: Distributed Log Collection for Hadoop Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

The problem with using tail


If you have used any of the Flume 0.9 releases, you'll notice that the TailSource is no longer a part of Flume. TailSource provided a mechanism to "tail" (http://en.wikipedia.org/wiki/Tail_(Unix)) any file on the system and create Flume events for each line of the file. It could also handle file rotations, so many used the filesystem as a handoff point between the application creating the data (for instance, log4j) and the mechanism responsible for moving those files someplace else (for instance, syslog).

As is the case with both channels and sinks, events are added and removed from a channel as part of a transaction. When you are tailing a file, there is no way to participate properly in a transaction. If failure to write successfully to a channel occurred, or if the channel was simply full (a more likely event than failure), the data couldn't be "put back" as rollback semantics dictate.

Furthermore, if the rate of data written to a file exceeds the rate Flume...