Book Image

HDInsight Essentials - Second Edition

By : Rajesh Nadipalli
Book Image

HDInsight Essentials - Second Edition

By: Rajesh Nadipalli

Overview of this book

Table of Contents (16 chapters)
HDInsight Essentials Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Managing file metadata using HCatalog


Organizing data in specific directories based on the content and source does provide the foundation for a well-managed Data Lake. In addition to file location, a managed Data Lake should capture key attributes and structure information of the file; for example, for the sales table being ingested to Data Lake in data/stage/salesdb01/sales, the attributes will be as follows:

  • Structure of the file: For example, fixed length, delimited, XML, JSON, sequence, and columnar (RC)

  • Fields/columns in the data file: For example, fiscal quarter, $amount

  • Data types of the fields: For example, integer, string, double, and string

Apache HCatalog provides a table management system for the HDFS based filesystem. It provides the equivalent of information_schema tables of SQL Server. HCatalog will store the format/structure information.

Key benefits

The following are the key benefits of using HCatalog:

  • Stores structural metadata of HDFS files in a shared metastore

  • Provides interface...