Book Image

HDInsight Essentials - Second Edition

By : Rajesh Nadipalli
Book Image

HDInsight Essentials - Second Edition

By: Rajesh Nadipalli

Overview of this book

Table of Contents (16 chapters)
HDInsight Essentials Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Journey to your Data Lake dream


Hadoop's HDFS and YARN are the core components for the next generation Data Lake; there are several other components that need to be built to realize the vision. In this section, we will see the core capabilities that need to be built in order to enable an Enterprise Data Lake. The following are the key components that need to be built for an effective Data Lake:

Let us look into each component in detail.

Ingestion and organization

Data Lake based on HDFS has a scalable and distributed filesystem that requires a scalable ingestion framework and software that can take in structured, unstructured, and streaming data.

A managed Data Lake requires data to be well-organized and this requires several kinds of metadata. The following are key metadata that require management:

  • File inventory: What, when, and who about files ingested to Hadoop?

  • Structural metadata: What is the structure of a file such as XML, HL7, CSV, and TSV?

    Note

    Hadoop does work well with Avro sequence...