Book Image

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra
Book Image

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.
Table of Contents (23 chapters)
Title Page
About the Authors
About the Reviewers
Customer Feedback
Part 1 - Overview
Part 2 - Technical Building blocks of Data Lake
Part 3 - Bringing It All Together

Part 2 - Technical Building blocks of Data Lake

This part of the book introduces reader to many technologies which will be part of the Data Lake implementation. Each chapter covers a technology which will slowly build the Data Lake and the use case namely Single Customer View (SCV) in the due course. Almost all the important technical details of the technology being discussed in each chapter would be covered in a holistic fashion as in-depth coverage is out of scope of this book. It consists of six chapters and each chapter has a goal well defined to be achieved as detailed below.

Chapter 5, Data Acquisition of Batch Data using Apache Sqoop, delves deep into Apache Sqoop. It gives reasons for this choice and also gives the reader other technology options with good amount of details. The chapter also gives a detailed example connecting Data Lake and Lambda Architecture. In this chapter the reader will understand Sqoop framework and similar tools in the space for data loads from an enterprise data source into a Data Lake. The reader will understand the technical details around Sqoop and architecturally the problems that it solves. The reader will also be taken through examples, where the  Sqoop will be seen in action and various steps involved in using it with Hadoop technologies.

Chapter 6, Data Acquisition of Stream Data using Apache Flume, delves deep into Apache Flume, thus connecting technologies  in purview of Data Lake and Lambda Architecture. The reader will understand Flume as a framework and its various patterns by which it can be leveraged for Data Lake. The reader will also understand the Flume architecture and technical details around using it to acquire and consume data using this framework in detail, with specific capabilities around transaction control and data replay with working example. The reader will also understand how to use flume with streaming technologies for stream based processing.

Chapter 7, Messaging Layer using Apache Kafka, delves deep into Apache Kafka. This part of the book initially gives the reader the reason for choosing a particular technology and also gives details of other technology options. . In this chapter, the reader would understand Kafka as a message oriented middleware and how it’s compared with other messaging engines. The reader will get to know details around Kafka and its functioning and how it can be leveraged for building scale-out capabilities, from the perspective of client (publisher), broker and consumer(subscriber). This reader will also understand how to integrate Kafka with Hadoop components for acquiring enterprise data and what capabilities this integration brings to Data Lake.

Chapter 8, Data Processing using Apache Flink, the reader in this chapter would understand the concepts around streaming and stream based processing, and specifically in reference to Apache Flink. The reader will get deep into using Apache Flink in context of Data Lake and in the Big Data technology landscape for near real time processing of data with working examples. The reader will also realize how a streaming functionality would depend on various other layers in architecture and how these layers can influence the streaming capability.

Chapter 9, Data Storage using Apache Hadoop, delves deep into Apache Hadoop.  In this chapter, the reader would get deeper into Hadoop Landscape with various Hadoop components and their functioning and specific capabilities that these components can provide for an enterprise Data Lake. Hadoop in context of Data Lake is explained at an implementation level and how Hadoop frameworks capabilities around file storage, file formats and map-reduce capabilities can constitute the foundation for a Data Lake and specific patterns that can be applied to this stack for near real time capabilities.

Chapter 10, Indexed Data Store using Elasticsearch, delves deep into Elasticsearch.  The reader will understand Elasticsearch as data indexing framework and various data analyzers provided by the framework for efficient searches. The reader will also understand how elasticsearch can be leveraged for Data Lake and data at scale with efficient sharding and distribution mechanisms for consistent performance. The reader will also understand how elasticsearch can be used for fast streaming and how it can used for high performance applications with working examples.