This figure shows where we have reached with our Data Lake after covering part 2 of this book:
Figure 01: Data Lake implemented so far in this book
HDFS | Distributed File Storage |
MapReduce | Batch Processing Engine |
YARN | Resource Negotiator |
HBase | Columnar and Key Value NoSQL database that runs on HDFS |
Hive | Query engine that provides SQL like access to HDFS |
Impala | Fast Query Engine for analytical queries on HDFS |
Sqoop | Data Acquisition and Ingestion |
Flume | Data Acquisition and Ingestion via streamed flume events |
Kafka | Highly Scalable Distributed Messaging Engine |
Flink | All purpose Real Time data processing and ingestion with Batch Support |
Spark | All purpose Fast Batch Processing and ingestion with support for real time processing via micro-batches |
Elasticsearch | Fast Distributed Indexing Engine built on Lucene, also used as a Document based NoSQL data store. |
By this time, in your Data Lake data would have flown from various source systems, through various Data Lake components and persisted. You...