Data Store Using Apache Hadoop | Data Lake for Enterprises

Book Overview & Buying
Table Of Contents

Data Lake for Enterprises

By : Mishra, John, Pankaj Misra

2.9 (8)

Buy this Book

Data Lake for Enterprises

2.9 (8)

By: Mishra, John, Pankaj Misra

Buy this Book

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

Introduction to Data

Exploring data

What is Enterprise Data?

Enterprise Data Management

Big data concepts

Relevance of data

Quality of data

Where does this data live in an enterprise?

Enterprise's current state

Enterprise digital transformation

Data lake use case enlightenment

Summary

Comprehensive Concepts of a Data Lake

What is a Data Lake?

How does a Data Lake help enterprises?

How Data Lake works?

Differences between Data Lake and Data Warehouse

Approaches to building a Data Lake

Lambda Architecture-driven Data Lake

Summary

Lambda Architecture as a Pattern for Data Lake

What is Lambda Architecture?

History of Lambda Architecture

Principles of Lambda Architecture

Components of a Lambda Architecture

Complete working of a Lambda Architecture

Advantages of Lambda Architecture

Disadvantages of Lambda Architectures

Technology overview for Lambda Architecture

Applied lambda

Working examples of Lambda Architecture

Kappa architecture

Summary

Applied Lambda for Data Lake

Knowing Hadoop distributions

Selection factors for a big data stack for enterprises

Batch layer for data processing

Serving layer

Summary

Data Acquisition of Batch Data using Apache Sqoop

Context in data lake - data acquisition

Why Apache Sqoop

Workings of Sqoop

Sqoop connectors

Sqoop support for HDFS

Sqoop working example

When to use Sqoop

When not to use Sqoop

Real-time Sqooping: a possibility?

Other options

Summary

Data Acquisition of Stream Data using Apache Flume

Context in Data Lake: data acquisition

Why Flume?

Flume architecture principles

The Flume Architecture

Flume event - Stream Data

Flume agent

Flume source

Flume Channel

Flume sink

Flume configuration

Flume transaction management

Other flume components

Context Routing

Flume working example

When to use Flume

When not to use Flume

Other options

Summary

Messaging Layer using Apache Kafka

Context in Data Lake- messaging layer

Why Apache Kafka

Kafka architecture

Other Kafka components

Kafka programming interface

Producer and consumer reliability

Kafka security

Kafka as message-oriented middleware

Scale-out architecture with Kafka

Kafka connect

Kafka working example

When to use Kafka

When not to use Kafka

Other options

Summary

Data Processing using Apache Flink

Context in a Data Lake - Data Ingestion Layer

Why Apache Flink?

Working of Flink

Flink API's

Flink working example

When to use Flink

When not to use Flink

Other options

Summary

Data Store Using Apache Hadoop

Context for Data Lake - Data Storage and lambda Batch layer

Why Hadoop?

Working of Hadoop

Hadoop ecosystem

Hadoop distributions

HDFS and formats

Hadoop for near real-time applications

Hadoop deployment modes

Hadoop working examples

When not to use Hadoop

Other Hadoop Processing Options

Summary

Indexed Data Store using Elasticsearch

Context in Data Lake: data storage and lambda speed layer

What is Elasticsearch?

Why Elasticsearch

Working of Elasticsearch

Elastic Stack

Elastic Cloud

Elasticsearch DSL (Query DSL)

Nodes in Elasticsearch

Elasticsearch and relational database

Elasticsearch ecosystem

Elasticsearch deployment options

Clients for Elasticsearch

Elasticsearch for fast streaming layer

Elasticsearch as a data source

Elasticsearch for content indexing

Elasticsearch and Hadoop

Elasticsearch working example

When to use Elasticsearch

When not to use Elasticsearch

Other options

Summary

Data Lake Components Working Together

Where we stand with Data Lake

Core architecture principles of Data Lake

Challenges faced by enterprise Data Lake

Expectations from Data Lake

Data Lake for other activities

Knowing more about data storage

Knowing more about Data processing

Thoughts on data security

Thoughts on data encryption

Metadata management and governance

Thoughts on Data Auditing

Thoughts on data traceability

Knowing more about Serving Layer

Summary

Data Lake Use Case Suggestions

Establishing cybersecurity practices in an enterprise

Know the customers dealing with your enterprise

Bring efficiency in warehouse management

Developing a brand and marketing of the enterprise

Achieve a higher degree of personalization with customers

Bringing IoT data analysis at your fingertips

More practical and useful data archival

Compliment the existing data warehouse infrastructure

Achieving telecom security and regulatory compliance

Summary

Data Lake for Enterprises

By : Mishra, John, Pankaj Misra

Data Lake for Enterprises

By: Mishra, John, Pankaj Misra

Overview of this book

Working of Hadoop

Hadoop core architecture principles

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access