Hadoop's HDFS and YARN are the core components for the next generation Data Lake; there are several other components that need to be built to realize the vision. In this section, we will see the core capabilities that need to be built in order to enable an Enterprise Data Lake. The following are the key components that need to be built for an effective Data Lake:
Let us look into each component in detail.
Data Lake based on HDFS has a scalable and distributed filesystem that requires a scalable ingestion framework and software that can take in structured, unstructured, and streaming data.
A managed Data Lake requires data to be well-organized and this requires several kinds of metadata. The following are key metadata that require management: