Overview and Architecture | Apache Flume: Distributed Log Collection for Hadoop

Book Overview & Buying
Table Of Contents

Apache Flume: Distributed Log Collection for Hadoop

4.6 (7)

Buy this Book

Apache Flume: Distributed Log Collection for Hadoop

4.6 (7)

Buy this Book

Overview of this book

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms. Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation. Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume. It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them. By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.

Apache Flume: Distributed Log Collection for Hadoop

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Overview and Architecture

Flume 0.9

Flume 1.X (Flume-NG)

The problem with HDFS and streaming data/logs

Sources, channels, and sinks

Flume events

Summary

Flume Quick Start

Downloading Flume

Flume configuration file overview

Starting up with "Hello World"

Summary

Channels

Memory channel

File channel

Summary

Sinks and Sink Processors

HDFS sink

Compression codecs

Event serializers

Sink groups

Summary

Sources and Channel Selectors

The problem with using tail

The exec source

The spooling directory source

Syslog sources

Channel selectors

Summary

Interceptors, ETL, and Routing

Interceptors

Tiering data flows

Routing

Summary

Monitoring Flume

Monitoring the agent process

Monitoring performance metrics

Summary

There Is No Spoon – The Realities of Real-time Distributed Data Collection

Transport time versus log time

Time zones are evil

Capacity planning

Considerations for multiple data centers

Compliance and data expiry

Summary

Index

Apache Flume: Distributed Log Collection for Hadoop

Apache Flume: Distributed Log Collection for Hadoop

Overview of this book

Chapter 1. Overview and Architecture

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access