Apache Flume: Distributed Log Collection for Hadoop

HDFS isn't a real filesystem, at least not in the traditional sense, and many of the things we take for granted with normal filesystems don't apply here, such as being able to mount it. This makes getting your streaming data into Hadoop a little more complicated.

In a regular POSIX-style filesystem, if you open a file and write data, it still exists on the disk before the file is closed. That is, if another program opens the same file and starts reading, it will get the data already flushed by the writer to the disk. Furthermore, if this writing process is interrupted, any portion that made it to disk is usable (it may be incomplete, but it exists).

In HDFS, the file exists only as a directory entry; it shows zero length until the file is closed. This means that if data is written to a file for an extended period without closing it, a network disconnect with the client will leave you with nothing but an empty file for all your efforts. This may lead you to the conclusion that it would be wise to write small files so that you can close them as soon as possible.

The problem is that Hadoop doesn't like lots of tiny files. As the HDFS filesystem metadata is kept in memory on the NameNode, the more files you create, the more RAM you'll need to use. From a MapReduce prospective, tiny files lead to poor efficiency. Usually, each Mapper is assigned a single block of a file as the input (unless you have used certain compression codecs). If you have lots of tiny files, the cost of starting the worker processes can be disproportionally high compared to the data it is processing. This kind of block fragmentation also results in more Mapper tasks, increasing the overall job run times.

These factors need to be weighed when determining the rotation period to use when writing to HDFS. If the plan is to keep the data around for a short time, then you can lean toward the smaller file size. However, if you plan on keeping the data for a very long time, you can either target larger files or do some periodic cleanup to compact smaller files into fewer, larger files to make them more MapReduce friendly. After all, you only ingest the data once, but you might run a MapReduce job on that data hundreds or thousands of times.

Apache Flume: Distributed Log Collection for Hadoop

By : Steven Hoffman

Apache Flume: Distributed Log Collection for Hadoop

By: Steven Hoffman

Overview of this book

Related Content you might be interested in

Current Title:

Apache Flume: Distributed Log Collection for Hadoop

The problem with HDFS and streaming data/logs