Organizing data in specific directories based on the content and source does provide the foundation for a well-managed Data Lake. In addition to file location, a managed Data Lake should capture key attributes and structure information of the file; for example, for the sales table being ingested to Data Lake in data/stage/salesdb01/sales
, the attributes will be as follows:
Structure of the file: For example, fixed length, delimited, XML, JSON, sequence, and columnar (RC)
Fields/columns in the data file: For example, fiscal quarter, $amount
Data types of the fields: For example, integer, string, double, and string
Apache HCatalog provides a table management system for the HDFS based filesystem. It provides the equivalent of information_schema
tables of SQL Server. HCatalog will store the format/structure information.