HDInsight Essentials - Second Edition

Book Image

HDInsight Essentials - Second Edition

By : Rajesh Nadipalli

Book Image

HDInsight Essentials - Second Edition

By: Rajesh Nadipalli

Overview of this book

HDInsight Essentials Second Edition

HDInsight Essentials Second Edition

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Hadoop and HDInsight in a Heartbeat

Hadoop and HDInsight in a Heartbeat

Data is everywhere

Hadoop concepts

Hadoop distributions

HDInsight overview

Hadoop on Windows deployment options

Enterprise Data Lake using HDInsight

Enterprise Data Lake using HDInsight

Enterprise Data Warehouse architecture

The next generation Hadoop-based Enterprise data architecture

Journey to your Data Lake dream

Tools and technology for Hadoop ecosystem

Use case powered by Microsoft HDInsight

HDInsight Service on Azure

HDInsight Service on Azure

Registering for an Azure account

Provisioning an HDInsight cluster

HDInsight management dashboard

Exploring clusters using the remote desktop

Deleting the cluster

HDInsight Emulator for the development

Administering Your HDInsight Cluster

Administering Your HDInsight Cluster

Monitoring cluster health

Name Node status

Hadoop Service Availability

YARN Application Status

Azure storage management

Azure PowerShell

Ingest and Organize Data Lake

Ingest and Organize Data Lake

End-to-end Data Lake solution

Ingesting to Data Lake using HDFS command

Loading data to Azure Blob storage using Azure PowerShell

Loading files to Data Lake using GUI tools

Using Sqoop to move data from RDBMS to Data Lake

Organizing your Data Lake in HDFS

Managing file metadata using HCatalog

Transform Data in the Data Lake

Transform Data in the Data Lake

Transformation overview

Tools for transforming data in Data Lake

Transformation for the OTP project

Other tools used for transformation

Analyze and Report from Data Lake

Analyze and Report from Data Lake

Data access overview

Analysis using Excel and Microsoft Hive ODBC driver

Analysis using Excel Power Query

Other BI features in Excel

Ad hoc analysis using Hive

Other alternatives for analysis

HDInsight 3.1 New Features

HDInsight 3.1 New Features

Strategy for a Successful Data Lake Implementation

Strategy for a Successful Data Lake Implementation

Challenges on building a production Data Lake

The success path for a production Data Lake

Architectural considerations

Online resources

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Managing file metadata using HCatalog

Organizing data in specific directories based on the content and source does provide the foundation for a well-managed Data Lake. In addition to file location, a managed Data Lake should capture key attributes and structure information of the file; for example, for the sales table being ingested to Data Lake in data/stage/salesdb01/sales, the attributes will be as follows:

Structure of the file: For example, fixed length, delimited, XML, JSON, sequence, and columnar (RC)
Fields/columns in the data file: For example, fiscal quarter, $amount
Data types of the fields: For example, integer, string, double, and string

Apache HCatalog provides a table management system for the HDFS based filesystem. It provides the equivalent of information_schema tables of SQL Server. HCatalog will store the format/structure information.

Key benefits

The following are the key benefits of using HCatalog:

Stores structural metadata of HDFS files in a shared metastore
Provides interface...