Apache Flume: Distributed Log Collection for Hadoop

Flume was first introduced in Cloudera's CDH3 distribution in 2011. It consisted of a federation of worker daemons (agents) configured from a centralized master (or masters) via Zookeeper (a federated configuration and coordination system). From the master, you could check the agent status in a web UI as well as push out configuration centrally from the UI or via a command-line shell (both really communicating via Zookeeper to the worker agents).

Data could be sent in one of three modes: Best effort (BE), Disk Failover (DFO), and End-to-End (E2E). The masters were used for the E2E mode acknowledgements and multimaster configuration never really matured, so you usually only had one master, making it a central point of failure for E2E data flows. The BE mode is just what it sounds like: the agent would try to send the data, but if it couldn't, the data would be discarded. This mode is good for things such as metrics, where gaps can easily be tolerated, as new data is just a second away. The DFO mode stores undeliverable data to the local disk (or sometimes, a local database) and would keep retrying until the data could be delivered to the next recipient in your data flow. This is handy for those planned (or unplanned) outages, as long as you have sufficient local disk space to buffer the load.

In June, 2011, Cloudera moved control of the Flume project to the Apache Foundation. It came out of the incubator status a year later in 2012. During the incubation year, work had already begun to refactor Flume under the Star-Trek-themed tag, Flume-NG (Flume the Next Generation).

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Related Content you might be interested in

Current Title:

Apache Flume: Distributed Log Collection for Hadoop

Flume 0.9