Apache Flume: Distributed Log Collection for Hadoop

Book Image

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Book Image

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Apache Flume: Distributed Log Collection for Hadoop Second Edition

Apache Flume: Distributed Log Collection for Hadoop Second Edition

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Overview and Architecture

Overview and Architecture

Flume 1.X (Flume-NG)

The problem with HDFS and streaming data/logs

Sources, channels, and sinks

A Quick Start Guide to Flume

A Quick Start Guide to Flume

Downloading Flume

An overview of the Flume configuration file

Starting up with "Hello, World!"

Channels

The memory channel

The file channel

Spillable Memory Channel

Sinks and Sink Processors

Sinks and Sink Processors

Compression codecs

Event Serializers

MorphlineSolrSink

ElasticSearchSink

Sources and Channel Selectors

Sources and Channel Selectors

The problem with using tail

The Exec source

Spooling Directory Source

Channel selectors

Interceptors, ETL, and Routing

Interceptors, ETL, and Routing

The embedded agent

Putting It All Together

Putting It All Together

Web logs to searchable UI

Archiving to HDFS

Monitoring Flume

Monitoring Flume

Monitoring the agent process

Monitoring performance metrics

There Is No Spoon – the Realities of Real-time Distributed Data Collection

There Is No Spoon – the Realities of Real-time Distributed Data Collection

Transport time versus log time

Time zones are evil

Capacity planning

Considerations for multiple data centers

Compliance and data expiry

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Summary

In this chapter, we discussed the problem that Flume is attempting to solve: getting data into your Hadoop cluster for data processing in an easily configured, reliable way. We also discussed the Flume agent and its logical components, including events, sources, channel selectors, channels, sink processors, and sinks. Finally, we briefly discussed Morphlines as a powerful new ETL (Extract, Transform, Load) library, starting with Version 1.4 of Flume.

The next chapter will cover these in more detail, specifically, the most commonly used implementations of each. Like all good open source projects, almost all of these components are extensible if the bundled ones don't do what you need them to do.