Data Lake for Enterprises

Book Image

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra

Book Image

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Title Page

Credits

Foreword

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Customer Feedback

Customer Feedback

Preface

Part 1 - Overview

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Part 3 - Bringing It All Together

Free Chapter

Introduction to Data

Introduction to Data

What is Enterprise Data?

Enterprise Data Management

Big data concepts

Relevance of data

Quality of data

Where does this data live in an enterprise?

Enterprise’s current state

Enterprise digital transformation

Data lake use case enlightenment

Comprehensive Concepts of a Data Lake

Comprehensive Concepts of a Data Lake

What is a Data Lake?

How does a Data Lake help enterprises?

How Data Lake works?

Differences between Data Lake and Data Warehouse

Approaches to building a Data Lake

Lambda Architecture-driven Data Lake

Lambda Architecture as a Pattern for Data Lake

Lambda Architecture as a Pattern for Data Lake

What is Lambda Architecture?

History of Lambda Architecture

Principles of Lambda Architecture

Components of a Lambda Architecture

Complete working of a Lambda Architecture

Advantages of Lambda Architecture

Disadvantages of Lambda Architectures

Technology overview for Lambda Architecture

Working examples of Lambda Architecture

Kappa architecture

Applied Lambda for Data Lake

Applied Lambda for Data Lake

Knowing Hadoop distributions

Selection factors for a big data stack for enterprises

Batch layer for data processing

Data Acquisition of Batch Data using Apache Sqoop

Data Acquisition of Batch Data using Apache Sqoop

Context in data lake - data acquisition

Why Apache Sqoop

Workings of Sqoop

Sqoop connectors

Sqoop support for HDFS

Sqoop working example

When to use Sqoop

When not to use Sqoop

Real-time Sqooping: a possibility?

Data Acquisition of Stream Data using Apache Flume

Data Acquisition of Stream Data using Apache Flume

Context in Data Lake: data acquisition

Flume architecture principles

The Flume Architecture

Flume event - Stream Data

Flume configuration

Flume transaction management

Other flume components

Context Routing

Flume working example

When to use Flume

When not to use Flume

Messaging Layer using Apache Kafka

Messaging Layer using Apache Kafka

Context in Data Lake - messaging layer

Why Apache Kafka

Kafka architecture

Other Kafka components

Kafka programming interface

Producer and consumer reliability

Kafka as message-oriented middleware

Scale-out architecture with Kafka

Kafka working example

When to use Kafka

When not to use Kafka

Data Processing using Apache Flink

Data Processing using Apache Flink

Context in a Data Lake - Data Ingestion Layer

Why Apache Flink?

Working of Flink

Flink working example

When to use Flink

When not to use Flink

Data Store Using Apache Hadoop

Data Store Using Apache Hadoop

Context for Data Lake - Data Storage and lambda Batch layer

Working of Hadoop

Hadoop ecosystem

Hadoop distributions

HDFS and formats

Hadoop for near real-time applications

Hadoop deployment modes

Hadoop working examples

When not to use Hadoop

Other Hadoop Processing Options

Indexed Data Store using Elasticsearch

Indexed Data Store using Elasticsearch

Context in Data Lake: data storage and lambda speed layer

What is Elasticsearch?

Why Elasticsearch

Working of Elasticsearch

Elasticsearch DSL (Query DSL)

Nodes in Elasticsearch

Elasticsearch and relational database

Elasticsearch ecosystem

Elasticsearch deployment options

Clients for Elasticsearch

Elasticsearch for fast streaming layer

Elasticsearch as a data source

Elasticsearch for content indexing

Elasticsearch and Hadoop

Elasticsearch working example

Indexing Documents

Getting Indexed Document

Searching Documents

Updating Documents

Deleting a document

Elasticsearch in purview of SCV use case

Data Lake Components Working Together

Data Lake Components Working Together

Where we stand with Data Lake

Core architecture principles of Data Lake

Challenges faced by enterprise Data Lake

Expectations from Data Lake

Data Lake for other activities

Knowing more about data storage

Knowing more about Data processing

Thoughts on data security

Thoughts on data encryption

Metadata management and governance

Thoughts on Data Auditing

Thoughts on data traceability

Knowing more about Serving Layer

Data Lake Use Case Suggestions

Data Lake Use Case Suggestions

Establishing cybersecurity practices in an enterprise

Know the customers dealing with your enterprise

Bring efficiency in warehouse management

Developing a brand and marketing of the enterprise

Achieve a higher degree of personalization with customers

Bringing IoT data analysis at your fingertips

More practical and useful data archival

Compliment the existing data warehouse infrastructure

Achieving telecom security and regulatory compliance

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Flume event - Stream Data

Event is the unit of data which is send across the Flume pipeline. The structure of the event is quite simple and had two parts to it namely:

Event header: A Key/Value pair in the form Map<String, String>. These headers are meant to add more data about the event. For example, these headers can hold severity and priority aspects of this event, and so on. These headers can also contain UUID or event ID which distinguishes one event from the other.
Event payload: An array of bytes (byte array) in the form byte[]. 32 KB is the default body size, which is usually truncated after that figure but this is a configurable value in Flume.

This figure shows the internal structure of the Flume event, which hops from one agent to another in Flume:

Figure 13: Anatomy of a Flume event