Apache Flume: Distributed Log Collection for Hadoop

When people speak of Hadoop, they usually refer to storing lots of data for a long time, usually in HDFS, so more interesting data science or machine learning can be done later. Let's extend our use case by splitting the data flow at the collector to store an extra copy in HDFS for later use.

So, back in Amazon AWS, I start a fourth server to run Hadoop. If you plan on doing all your work in Hadoop, you'll probably want to write this data to S3, but for this example, let's stick with HDFS. Now our server diagram looks like this:

I used Cloudera's one-line installation instructions to speed up the setup. It's instructions can be found at http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_qs_mrv1_pseudo.html.

Since the Amazon AMI is compatible with Enterprise Linux 6, I selected the EL6 RPM Repository and imported the corresponding GPG key:

[ec2-user@ip-172-31-7-175 ~]$ sudo rpm -ivh http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64...

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Related Content you might be interested in

Current Title:

Apache Flume: Distributed Log Collection for Hadoop

Archiving to HDFS