Chapter 7. Building a Robust and Fault-Tolerant Data Collection System
In the previous chapters, we looked at how to create a data pipeline for our data intensive system. We took RabbitMQ as an example of data pipeline and discussed various message integration patterns that should be taken into account when developing such a component for your system. We discussed how we could make such a system highly available, as well as distributed, to handle high loads.
In this chapter, we will make use of the knowledge gained in the previous chapter and build on top of it.
We will discuss various technologies that are available in the open source world that help us to collect data from disparate sources.
In this chapter, we will focus on:
- Apache Flume
- Apache Sqoop
- The ELK (Elasticsearch, Logstash, Beats, and Kibana) Stack
- Apache NIFI
If you remember, in Chapter 5, Understanding Data Collection and Normalization Requirements and Techniques, we looked at the design and possible implementation of a data collection...