Data Lake for Enterprises

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra

Buy this Book

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Buy this Book

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Title Page

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Free Chapter

Introduction to Data

Exploring data

What is Enterprise Data?

Enterprise Data Management

Big data concepts

Relevance of data

Quality of data

Where does this data live in an enterprise?

Enterprise’s current state

Enterprise digital transformation

Data lake use case enlightenment

Summary

Comprehensive Concepts of a Data Lake

What is a Data Lake?

How does a Data Lake help enterprises?

How Data Lake works?

Differences between Data Lake and Data Warehouse

Approaches to building a Data Lake

Lambda Architecture-driven Data Lake

Summary

Lambda Architecture as a Pattern for Data Lake

What is Lambda Architecture?

History of Lambda Architecture

Principles of Lambda Architecture

Components of a Lambda Architecture

Complete working of a Lambda Architecture

Advantages of Lambda Architecture

Disadvantages of Lambda Architectures

Technology overview for Lambda Architecture

Applied lambda

Working examples of Lambda Architecture

Kappa architecture

Summary

Applied Lambda for Data Lake

Knowing Hadoop distributions

Selection factors for a big data stack for enterprises

Batch layer for data processing

Serving layer

Summary

Data Acquisition of Batch Data using Apache Sqoop

Context in data lake - data acquisition

Why Apache Sqoop

Workings of Sqoop

Sqoop connectors

Sqoop support for HDFS

Sqoop working example

When to use Sqoop

When not to use Sqoop

Real-time Sqooping: a possibility?

Other options

Summary

Data Acquisition of Stream Data using Apache Flume

Context in Data Lake: data acquisition

Why Flume?

Flume architecture principles

The Flume Architecture

Flume event - Stream Data

Flume transaction management

Other flume components

Context Routing

Flume working example

When to use Flume

When not to use Flume

Other options

Summary

Messaging Layer using Apache Kafka

Context in Data Lake - messaging layer

Why Apache Kafka

Kafka architecture

Other Kafka components

Kafka programming interface

Producer and consumer reliability

Kafka security

Kafka as message-oriented middleware

Scale-out architecture with Kafka

Kafka connect

Kafka working example

When to use Kafka

When not to use Kafka

Other options

Summary

Data Processing using Apache Flink

Context in a Data Lake - Data Ingestion Layer

Why Apache Flink?

Working of Flink

Flink API’s

Flink working example

When to use Flink

When not to use Flink

Other options

Summary

Data Store Using Apache Hadoop

Context for Data Lake - Data Storage and lambda Batch layer

Hadoop for near real-time applications

Hadoop deployment modes

Hadoop working examples

When not to use Hadoop

Other Hadoop Processing Options

Summary

Indexed Data Store using Elasticsearch

Context in Data Lake: data storage and lambda speed layer

What is Elasticsearch?

Why Elasticsearch

Working of Elasticsearch

Elastic Stack

Elastic Cloud

Elasticsearch DSL (Query DSL)

Nodes in Elasticsearch

Elasticsearch and relational database

Elasticsearch ecosystem

Elasticsearch deployment options

Clients for Elasticsearch

Elasticsearch for fast streaming layer

Elasticsearch as a data source

Elasticsearch for content indexing

Elasticsearch and Hadoop

Elasticsearch working example

Indexing Documents

Getting Indexed Document

Searching Documents

Updating Documents

Deleting a document

Elasticsearch in purview of SCV use case

Data Lake Components Working Together

Where we stand with Data Lake

Core architecture principles of Data Lake

Challenges faced by enterprise Data Lake

Expectations from Data Lake

Data Lake for other activities

Knowing more about data storage

Knowing more about Data processing

Thoughts on data security

Thoughts on data encryption

Metadata management and governance

Thoughts on Data Auditing

Thoughts on data traceability

Knowing more about Serving Layer

Summary

Data Lake Use Case Suggestions

Establishing cybersecurity practices in an enterprise

Know the customers dealing with your enterprise

Bring efficiency in warehouse management

Developing a brand and marketing of the enterprise

Achieve a higher degree of personalization with customers

Bringing IoT data analysis at your fingertips

More practical and useful data archival

Compliment the existing data warehouse infrastructure

Achieving telecom security and regulatory compliance

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Knowing Hadoop distributions

A Big Data ecosystem consists of multiple capabilities, and for every capability in the ecosystem, there are one or more frameworks. Different distributions realize these capabilities in their own specific ways and also have some additional edge over other competitors in the same space.

Figure 01: Hadoop distributions

Shown here are some of the leading distributions of Hadoop framework, wherein Cloudera, Hortonworks, and MapR are the leaders in commercial space while Apache Hadoop is an open source distribution. These commercial offerings, while having their own specific capabilities, are largely based on the specifications of the open source Hadoop framework.

Just to put a few things into the perspective of why a Hadoop distribution should be chosen, unfortunately there is no straight answer for it. However, we can compare these distributions across various dimensions that we may be interested in for evaluation.

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

Related Content you might be interested in

Current Title:

Data Lake for Enterprises

Modern Big Data Processing with Hadoop

Mastering Hadoop 3

Apache Hadoop 3 Quick Start Guide

Knowing Hadoop distributions