Apache Flume: Distributed Log Collection for Hadoop

Book Image

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Book Image

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Apache Flume: Distributed Log Collection for Hadoop Second Edition

Apache Flume: Distributed Log Collection for Hadoop Second Edition

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Overview and Architecture

Overview and Architecture

Flume 1.X (Flume-NG)

The problem with HDFS and streaming data/logs

Sources, channels, and sinks

A Quick Start Guide to Flume

A Quick Start Guide to Flume

Downloading Flume

An overview of the Flume configuration file

Starting up with "Hello, World!"

Channels

The memory channel

The file channel

Spillable Memory Channel

Sinks and Sink Processors

Sinks and Sink Processors

Compression codecs

Event Serializers

MorphlineSolrSink

ElasticSearchSink

Sources and Channel Selectors

Sources and Channel Selectors

The problem with using tail

The Exec source

Spooling Directory Source

Channel selectors

Interceptors, ETL, and Routing

Interceptors, ETL, and Routing

The embedded agent

Putting It All Together

Putting It All Together

Web logs to searchable UI

Archiving to HDFS

Monitoring Flume

Monitoring Flume

Monitoring the agent process

Monitoring performance metrics

There Is No Spoon – the Realities of Real-time Distributed Data Collection

There Is No Spoon – the Realities of Real-time Distributed Data Collection

Transport time versus log time

Time zones are evil

Capacity planning

Considerations for multiple data centers

Compliance and data expiry

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Chapter 7. Putting It All Together

Now that we've walked through all the components and configurations, let's put together a working end-to-end configuration. This example is by no means exhaustive, nor does it cover every possible scenario you might need, but I think it should cover a couple of common use cases I've seen over and over:

Finding errors by searching logs across multiple servers in near real time
Streaming data to HDFS for long-term batch processing

In the first situation, your systems may be impaired, and you have multiple places where you need to search for problems. Bringing all of those logs to a single place that you can search means getting your systems restored quickly. In the second scenario, you are interested in capturing data in the long term for analytics and machine learning.