Data Lake for Enterprises

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra

Buy this Book

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Buy this Book

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Title Page

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Free Chapter

Introduction to Data

Exploring data

What is Enterprise Data?

Enterprise Data Management

Big data concepts

Relevance of data

Quality of data

Where does this data live in an enterprise?

Enterprise’s current state

Enterprise digital transformation

Data lake use case enlightenment

Summary

Comprehensive Concepts of a Data Lake

What is a Data Lake?

How does a Data Lake help enterprises?

How Data Lake works?

Differences between Data Lake and Data Warehouse

Approaches to building a Data Lake

Lambda Architecture-driven Data Lake

Summary

Lambda Architecture as a Pattern for Data Lake

What is Lambda Architecture?

History of Lambda Architecture

Principles of Lambda Architecture

Components of a Lambda Architecture

Complete working of a Lambda Architecture

Advantages of Lambda Architecture

Disadvantages of Lambda Architectures

Technology overview for Lambda Architecture

Applied lambda

Working examples of Lambda Architecture

Kappa architecture

Summary

Applied Lambda for Data Lake

Knowing Hadoop distributions

Selection factors for a big data stack for enterprises

Batch layer for data processing

Serving layer

Summary

Data Acquisition of Batch Data using Apache Sqoop

Context in data lake - data acquisition

Why Apache Sqoop

Workings of Sqoop

Sqoop connectors

Sqoop support for HDFS

Sqoop working example

When to use Sqoop

When not to use Sqoop

Real-time Sqooping: a possibility?

Other options

Summary

Data Acquisition of Stream Data using Apache Flume

Context in Data Lake: data acquisition

Why Flume?

Flume architecture principles

The Flume Architecture

Flume event - Stream Data

Flume transaction management

Other flume components

Context Routing

Flume working example

When to use Flume

When not to use Flume

Other options

Summary

Messaging Layer using Apache Kafka

Context in Data Lake - messaging layer

Why Apache Kafka

Kafka architecture

Other Kafka components

Kafka programming interface

Producer and consumer reliability

Kafka security

Kafka as message-oriented middleware

Scale-out architecture with Kafka

Kafka connect

Kafka working example

When to use Kafka

When not to use Kafka

Other options

Summary

Data Processing using Apache Flink

Context in a Data Lake - Data Ingestion Layer

Why Apache Flink?

Working of Flink

Flink API’s

Flink working example

When to use Flink

When not to use Flink

Other options

Summary

Data Store Using Apache Hadoop

Context for Data Lake - Data Storage and lambda Batch layer

Hadoop for near real-time applications

Hadoop deployment modes

Hadoop working examples

When not to use Hadoop

Other Hadoop Processing Options

Summary

Indexed Data Store using Elasticsearch

Context in Data Lake: data storage and lambda speed layer

What is Elasticsearch?

Why Elasticsearch

Working of Elasticsearch

Elastic Stack

Elastic Cloud

Elasticsearch DSL (Query DSL)

Nodes in Elasticsearch

Elasticsearch and relational database

Elasticsearch ecosystem

Elasticsearch deployment options

Clients for Elasticsearch

Elasticsearch for fast streaming layer

Elasticsearch as a data source

Elasticsearch for content indexing

Elasticsearch and Hadoop

Elasticsearch working example

Indexing Documents

Getting Indexed Document

Searching Documents

Updating Documents

Deleting a document

Elasticsearch in purview of SCV use case

Data Lake Components Working Together

Where we stand with Data Lake

Core architecture principles of Data Lake

Challenges faced by enterprise Data Lake

Expectations from Data Lake

Data Lake for other activities

Knowing more about data storage

Knowing more about Data processing

Thoughts on data security

Thoughts on data encryption

Metadata management and governance

Thoughts on Data Auditing

Thoughts on data traceability

Knowing more about Serving Layer

Summary

Data Lake Use Case Suggestions

Establishing cybersecurity practices in an enterprise

Know the customers dealing with your enterprise

Bring efficiency in warehouse management

Developing a brand and marketing of the enterprise

Achieve a higher degree of personalization with customers

Bringing IoT data analysis at your fingertips

More practical and useful data archival

Compliment the existing data warehouse infrastructure

Achieving telecom security and regulatory compliance

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Flume sink

Similar to the source, the sink is managed by SinkRunne, which manages the thread and execution model. Unlike a source, however, a sink is polling-based and polls the channel for events. The sink is the component that outputs (according to type of output required) it from the agent to an external or other source. Sinks also participate in transaction management, and when the output from a sink is successful, an acknowledgement is passed back to the channel. The channel then takes the event away from the persistence mechanism. Transaction management will be covered in detail in a separate section.

There are a variety of existing sinks available, as follows:

HDFS: Write to HDFS. This currently supports writing text and sequence files (in compressed format as well). The following is a sample HDFS sink configuration (taken from Flume user guide) for an agent named a1. The full configuration can be found in the Flume user guide (https://flume.apache.org):

a1.channels = c1
a1.sinks = k1...

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

Related Content you might be interested in

Current Title:

Data Lake for Enterprises

Modern Big Data Processing with Hadoop

Mastering Hadoop 3

Apache Hadoop 3 Quick Start Guide

Flume sink