Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Data Lake for Enterprises

By : Mishra, John, Pankaj Misra

2.9 (8)

Data Lake for Enterprises

2.9 (8)

By: Mishra, John, Pankaj Misra

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

Introduction to Data

Introduction to Data

Exploring data

What is Enterprise Data?

Enterprise Data Management

Big data concepts

Relevance of data

Quality of data

Where does this data live in an enterprise?

Enterprise's current state

Enterprise digital transformation

Data lake use case enlightenment

Summary

Comprehensive Concepts of a Data Lake

Comprehensive Concepts of a Data Lake

What is a Data Lake?

How does a Data Lake help enterprises?

How Data Lake works?

Differences between Data Lake and Data Warehouse

Approaches to building a Data Lake

Lambda Architecture-driven Data Lake

Summary

Lambda Architecture as a Pattern for Data Lake

Lambda Architecture as a Pattern for Data Lake

What is Lambda Architecture?

History of Lambda Architecture

Principles of Lambda Architecture

Components of a Lambda Architecture

Complete working of a Lambda Architecture

Advantages of Lambda Architecture

Disadvantages of Lambda Architectures

Technology overview for Lambda Architecture

Applied lambda

Working examples of Lambda Architecture

Kappa architecture

Summary

Applied Lambda for Data Lake

Applied Lambda for Data Lake

Knowing Hadoop distributions

Selection factors for a big data stack for enterprises

Batch layer for data processing

Serving layer

Summary

Data Acquisition of Batch Data using Apache Sqoop

Data Acquisition of Batch Data using Apache Sqoop

Context in data lake - data acquisition

Why Apache Sqoop

Workings of Sqoop

Sqoop connectors

Sqoop support for HDFS

Sqoop working example

When to use Sqoop

When not to use Sqoop

Real-time Sqooping: a possibility?

Other options

Summary

Data Acquisition of Stream Data using Apache Flume

Data Acquisition of Stream Data using Apache Flume

Context in Data Lake: data acquisition

Why Flume?

Flume architecture principles

The Flume Architecture

Flume event - Stream Data

Flume agent

Flume source

Flume Channel

Flume sink

Flume configuration

Flume transaction management

Other flume components

Context Routing

Flume working example

When to use Flume

When not to use Flume

Other options

Summary

Messaging Layer using Apache Kafka

Messaging Layer using Apache Kafka

Context in Data Lake- messaging layer

Why Apache Kafka

Kafka architecture

Other Kafka components

Kafka programming interface

Producer and consumer reliability

Kafka security

Kafka as message-oriented middleware

Scale-out architecture with Kafka

Kafka connect

Kafka working example

When to use Kafka

When not to use Kafka

Other options

Summary

Data Processing using Apache Flink

Data Processing using Apache Flink

Context in a Data Lake - Data Ingestion Layer

Why Apache Flink?

Working of Flink

Flink API's

Flink working example

When to use Flink

When not to use Flink

Other options

Summary

Data Store Using Apache Hadoop

Data Store Using Apache Hadoop

Context for Data Lake - Data Storage and lambda Batch layer

Why Hadoop?

Working of Hadoop

Hadoop ecosystem

Hadoop distributions

HDFS and formats

Hadoop for near real-time applications

Hadoop deployment modes

Hadoop working examples

When not to use Hadoop

Other Hadoop Processing Options

Summary

Indexed Data Store using Elasticsearch

Indexed Data Store using Elasticsearch

Context in Data Lake: data storage and lambda speed layer

What is Elasticsearch?

Why Elasticsearch

Working of Elasticsearch

Elastic Stack

Elastic Cloud

Elasticsearch DSL (Query DSL)

Nodes in Elasticsearch

Elasticsearch and relational database

Elasticsearch ecosystem

Elasticsearch deployment options

Clients for Elasticsearch

Elasticsearch for fast streaming layer

Elasticsearch as a data source

Elasticsearch for content indexing

Elasticsearch and Hadoop

Elasticsearch working example

When to use Elasticsearch

When not to use Elasticsearch

Other options

Summary

Data Lake Components Working Together

Data Lake Components Working Together

Where we stand with Data Lake

Core architecture principles of Data Lake

Challenges faced by enterprise Data Lake

Expectations from Data Lake

Data Lake for other activities

Knowing more about data storage

Knowing more about Data processing

Thoughts on data security

Thoughts on data encryption

Metadata management and governance

Thoughts on Data Auditing

Thoughts on data traceability

Knowing more about Serving Layer

Summary

Data Lake Use Case Suggestions

Data Lake Use Case Suggestions

Establishing cybersecurity practices in an enterprise

Know the customers dealing with your enterprise

Bring efficiency in warehouse management

Developing a brand and marketing of the enterprise

Achieve a higher degree of personalization with customers

Bringing IoT data analysis at your fingertips

More practical and useful data archival

Compliment the existing data warehouse infrastructure

Achieving telecom security and regulatory compliance

Summary

Indexed Data Store using Elasticsearch

In the previous chapter on Hadoop, we persisted the data in hand onto Hadoop (HDFS). Reading/querying data from Hadoop at a fast pace is an issue, and that's when an indexed data store such as Elasticsearch and its significance come forth in our Data Lake implementation.

As in other chapters in this part of the book, we will start off the chapter by explaining the layer where this technology will be used. We will then explain the reason for choosing this technology for this capability and start diving deep into Elasticsearch and its working. We will cover enough details on Elasticsearch so that you have adequate details to understand this technology. As always we will only give enough details and full deep dive is beyond the scope of this book.
We would then take you through a hands-on coding session, where you will first learn to install this technology and then...

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Data Lake for Enterprises

Search

Your notes and bookmarks