This section discusses the contributions by many contributors providing integration with Apache Kafka for various needs such as logging, packaging, cloud integration, and Hadoop integration.
Camus (https://github.com/linkedin/camus) which provides a pipeline from Kafka to HDFS. Under this project, a single MapReduce job performs the following steps to load data to HDFS in a distributed manner:
As a first step, it discovers the latest topics and partition offsets from ZooKeeper.
Each task in the MapReduce job fetches events from the Kafka broker and commits the pulled data along with the audit count to the output folders.
After the completion of the job, final offsets are written to HDFS and can be further consumed by subsequent MapReduce jobs.
Information about the consumed messages is also updated in the Kafka cluster.
Some other useful contributions are:
Automated deployment and configuration of Kafka and ZooKeeper on Amazon (https://github.com/nathanmarz/kafka-deploy...