HDInsight Essentials - Second Edition

Book Image

HDInsight Essentials - Second Edition

By : Rajesh Nadipalli

Book Image

HDInsight Essentials - Second Edition

By: Rajesh Nadipalli

Overview of this book

HDInsight Essentials Second Edition

HDInsight Essentials Second Edition

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Hadoop and HDInsight in a Heartbeat

Hadoop and HDInsight in a Heartbeat

Data is everywhere

Hadoop concepts

Hadoop distributions

HDInsight overview

Hadoop on Windows deployment options

Enterprise Data Lake using HDInsight

Enterprise Data Lake using HDInsight

Enterprise Data Warehouse architecture

The next generation Hadoop-based Enterprise data architecture

Journey to your Data Lake dream

Tools and technology for Hadoop ecosystem

Use case powered by Microsoft HDInsight

HDInsight Service on Azure

HDInsight Service on Azure

Registering for an Azure account

Provisioning an HDInsight cluster

HDInsight management dashboard

Exploring clusters using the remote desktop

Deleting the cluster

HDInsight Emulator for the development

Administering Your HDInsight Cluster

Administering Your HDInsight Cluster

Monitoring cluster health

Name Node status

Hadoop Service Availability

YARN Application Status

Azure storage management

Azure PowerShell

Ingest and Organize Data Lake

Ingest and Organize Data Lake

End-to-end Data Lake solution

Ingesting to Data Lake using HDFS command

Loading data to Azure Blob storage using Azure PowerShell

Loading files to Data Lake using GUI tools

Using Sqoop to move data from RDBMS to Data Lake

Organizing your Data Lake in HDFS

Managing file metadata using HCatalog

Transform Data in the Data Lake

Transform Data in the Data Lake

Transformation overview

Tools for transforming data in Data Lake

Transformation for the OTP project

Other tools used for transformation

Analyze and Report from Data Lake

Analyze and Report from Data Lake

Data access overview

Analysis using Excel and Microsoft Hive ODBC driver

Analysis using Excel Power Query

Other BI features in Excel

Ad hoc analysis using Hive

Other alternatives for analysis

HDInsight 3.1 New Features

HDInsight 3.1 New Features

Strategy for a Successful Data Lake Implementation

Strategy for a Successful Data Lake Implementation

Challenges on building a production Data Lake

The success path for a production Data Lake

Architectural considerations

Online resources

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Organizing your Data Lake in HDFS

As you load files to your Data Lake, it is important to have this process managed for data consumers in order to find the right data. Organization of data requires planning, coordination, and governance. One proposed model that I have seen used by several clients is to have three main directories:

Staging: This directory will host all the original source files, as they get ingested to the Data Lake. Each source should have its own directory. For example, let's consider that an organization has two financial databases, findb01 and findb02. A proposed directory structure in Data Lake can be /data/stage/findb01 and /data/stage/findb02.
Cleansed: The data in staging should go through basic audit and data quality checks to ensure that it meets the organization standards. For example, if sales data is being ingested to Data Lake, the state and country code in the sales records should be valid. The cleansed data should be grouped by subject area, for example, finance...