Gobblin is a universal data ingestion framework for the extract, transform, and load (ETL) of large volumes of data from a variety of data sources, such as files, databases, and Hadoop.
Gobblin also performs regular data ETL operations, such as job/task scheduling, state management, task partitioning, error handling, data quality checking, and data publishing.
Some features that make Gobblin very attractive are auto scalability, extensibility, fault tolerance, data quality assurance, and the ability to handle data model evolution.
For this recipe, it is necessary to have a Kafka cluster up and running. We also need an HDFS cluster up and running, into which we write the data.
The installation of Gobblin is also required. Follow the instructions on this page: http://gobblin.readthedocs.io/en/latest/Getting-Started.