Apache Flume: Distributed Log Collection for Hadoop

One of the new technologies incorporated in Flume, starting with Version 1.4, is something called a Morphline. You can think of a Morphline as a series of commands chained together to form a data transformation pipe.

If you are a fan of pipelining Unix commands, this will be very familiar to you. The commands themselves are intended to be small, single-purpose functions that when chained together create powerful logic. In many ways, using a Morphline command chain can be identical in functionality to the interceptor paradigm just mentioned. There is a Morphline interceptor we will cover in Chapter 6, Interceptors, ETL, and Routing, which you can use instead of, or in addition to, the included Java-based interceptors.

Note

To get an idea of how useful these commands can be, take a look at the handy grok command and its included extensible regular expression library at https://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/test/resources/grok-dictionaries/grok-patterns

Many of the custom Java interceptors that I've written in the past were to modify the body (data) and can easily be replaced with an out-of-the-box Morphline command chain. You can get familiar with the Morphline commands by checking out their reference guide at http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Flume Version 1.4 also includes a Morphline-backed sink used primarily to feed data into Solr. We'll see more of this in Chapter 4, Sinks and Sink Processors, Morphline Solr Search Sink.

Morphlines are just one component of the KiteSDK included in Flume. Starting with Version 1.5, Flume has added experimental support for KiteData, which is an effort to create a standard library for datasets in Hadoop. It looks very promising, but it is outside the scope of this book.

Note

Please see the project home page for more information, as it will certainly become more prominent in the Hadoop ecosystem as the technology matures. You can read all about the KiteSDK at http://kitesdk.org.

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Related Content you might be interested in

Current Title:

Apache Flume: Distributed Log Collection for Hadoop

The Kite SDK

Note

Note