Data Lake for Enterprises

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra

Buy this Book

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Buy this Book

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.

Title Page

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Free Chapter

Introduction to Data

Exploring data

What is Enterprise Data?

Enterprise Data Management

Big data concepts

Relevance of data

Quality of data

Where does this data live in an enterprise?

Enterprise’s current state

Enterprise digital transformation

Data lake use case enlightenment

Summary

Comprehensive Concepts of a Data Lake

What is a Data Lake?

How does a Data Lake help enterprises?

How Data Lake works?

Differences between Data Lake and Data Warehouse

Approaches to building a Data Lake

Lambda Architecture-driven Data Lake

Summary

Lambda Architecture as a Pattern for Data Lake

What is Lambda Architecture?

History of Lambda Architecture

Principles of Lambda Architecture

Components of a Lambda Architecture

Complete working of a Lambda Architecture

Advantages of Lambda Architecture

Disadvantages of Lambda Architectures

Technology overview for Lambda Architecture

Applied lambda

Working examples of Lambda Architecture

Kappa architecture

Summary

Applied Lambda for Data Lake

Knowing Hadoop distributions

Selection factors for a big data stack for enterprises

Batch layer for data processing

Serving layer

Summary

Data Acquisition of Batch Data using Apache Sqoop

Context in data lake - data acquisition

Why Apache Sqoop

Workings of Sqoop

Sqoop connectors

Sqoop support for HDFS

Sqoop working example

When to use Sqoop

When not to use Sqoop

Real-time Sqooping: a possibility?

Other options

Summary

Data Acquisition of Stream Data using Apache Flume

Context in Data Lake: data acquisition

Why Flume?

Flume architecture principles

The Flume Architecture

Flume event - Stream Data

Flume transaction management

Other flume components

Context Routing

Flume working example

When to use Flume

When not to use Flume

Other options

Summary

Messaging Layer using Apache Kafka

Context in Data Lake - messaging layer

Why Apache Kafka

Kafka architecture

Other Kafka components

Kafka programming interface

Producer and consumer reliability

Kafka security

Kafka as message-oriented middleware

Scale-out architecture with Kafka

Kafka connect

Kafka working example

When to use Kafka

When not to use Kafka

Other options

Summary

Data Processing using Apache Flink

Context in a Data Lake - Data Ingestion Layer

Why Apache Flink?

Working of Flink

Flink API’s

Flink working example

When to use Flink

When not to use Flink

Other options

Summary

Data Store Using Apache Hadoop

Context for Data Lake - Data Storage and lambda Batch layer

Hadoop for near real-time applications

Hadoop deployment modes

Hadoop working examples

When not to use Hadoop

Other Hadoop Processing Options

Summary

Indexed Data Store using Elasticsearch

Context in Data Lake: data storage and lambda speed layer

What is Elasticsearch?

Why Elasticsearch

Working of Elasticsearch

Elastic Stack

Elastic Cloud

Elasticsearch DSL (Query DSL)

Nodes in Elasticsearch

Elasticsearch and relational database

Elasticsearch ecosystem

Elasticsearch deployment options

Clients for Elasticsearch

Elasticsearch for fast streaming layer

Elasticsearch as a data source

Elasticsearch for content indexing

Elasticsearch and Hadoop

Elasticsearch working example

Indexing Documents

Getting Indexed Document

Searching Documents

Updating Documents

Deleting a document

Elasticsearch in purview of SCV use case

Data Lake Components Working Together

Where we stand with Data Lake

Core architecture principles of Data Lake

Challenges faced by enterprise Data Lake

Expectations from Data Lake

Data Lake for other activities

Knowing more about data storage

Knowing more about Data processing

Thoughts on data security

Thoughts on data encryption

Metadata management and governance

Thoughts on Data Auditing

Thoughts on data traceability

Knowing more about Serving Layer

Summary

Data Lake Use Case Suggestions

Establishing cybersecurity practices in an enterprise

Know the customers dealing with your enterprise

Bring efficiency in warehouse management

Developing a brand and marketing of the enterprise

Achieve a higher degree of personalization with customers

Bringing IoT data analysis at your fingertips

More practical and useful data archival

Compliment the existing data warehouse infrastructure

Achieving telecom security and regulatory compliance

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

Data is becoming very important for many enterprises and it has now become pivotal in many aspects. In fact, companies are transforming themselves with data at the core. This book will start by introducing data, its relevance to enterprises, and how they can make use of this data to transform themselves digitally. To make use of data, enterprises need repositories, and in this modern age, these aren't called data warehouses; instead they are called Data Lake.

As we can see today, we have a good number of use cases that are leveraging big data technologies. The concept of a Data Lake existed there for quite sometime, but recently it has been getting real traction in enterprises. This book brings these two aspects together and gives a hand-on, full-fledged, working Data Lake using the latest big data technologies, following well-established architectural patterns.

The book will bring Data Lake and Lambda architecture together and help the reader to actually operationalize these in their enterprise. It will introduce a number of Big Data technologies at a high level, but we didn't want to make it an authoritative reference on any of these topics, as they are vast in nature and worthy of a book by themselves.

This book instead covers pattern explanation and implementation using chosen technologies. The technologies can of course, be replaced with more relevant ones in future or according to set standards within an organization. So, this book will be relevant not only now but for a long time to come. Compared to a software/technology written targeting a specific version, this does not fall in that category, so the shelf life of this book is quite long compared to other books in the same space.

The book will take you on a fantastic journey, and in doing so, it follows a structure that is quite intuitive and exciting at the same time.

What this book covers

The book is divided into three parts. Each part contains a few chapters, and when a part is completed, readers will have a clear picture of that part of the book in a holistic fashion. The parts are designed and structured in such a way that the reader is first introduced to major functional and technical aspects; then in the following part, or rather the final part, things will all come together. At the end of the book, readers will have an operational Data Lake.

Part 1, Overview, introduces the reader to various concepts relating to data, Data Lake and important components . It consists of four chapters and as detailed below, each chapter well-defined goal to be achieved.

Chapter 1, Introduction to Data, introduces the reader to the book in general and then explains what data is and its relevance to the enterprise. The chapter explains the reasons as to why data in modern world is important and how it can/should be used. Real-life use cases have been showcased to explain the significance of data and how data is transforming businesses today. These real-life use cases will help readers to start their creative juices flowing and get thinking about how they can make a difference to their enterprise using data.

Chapter 2, Comprehensive Concepts of a Data Lake, deepens further into the details of the concept of a Data Lake and explains use of Data Lake in addressing the problems faced by enterprises. This chapter also provides a sneak preview around Lambda architecture and how it can be leveraged for Data Lake. The reader would thus get introduced to the concept of a Data Lake and the various approaches that organizations have adopted to build Data Lake.

Chapter 3, Lambda Architecture as a Pattern for Data Lake, introduces the reader into details of Lambda architecture, its various components and the connection between Data Lake and this architecture pattern. In this chapter the reader will get details around Lambda architecture, with the reasons of its inception and the specific problems that it solves. The chapter also provides the reader with ability to understand the core concepts of Lambda architecture and how to apply it in an enterprise. The reader will understand various patterns and components that can be leveraged to define lambda architecture both in the batch and real-time processing spaces. The reader would have enough background on data, Data Lake and Lambda architecture by now, and can move onto the next section of implementing Data Lake for your enterprise.

Chapter 4, Applied Lambda for Data Lake, introduces reader to technologies which can be used for each layer (component) in Lambda architecture and will also help the reader choose one lead technology in the market which we feel very good at this point in time. In this chapter, the reader will understand various Hadoop distributions in the current landscape of Big Data technologies, and how they can be leveraged for applying Lambda architecture in an enterprise Data Lake. In the context of these technologies, the reader will understand the details of and architectural motivations behind batch, speed and serving layer in an enterprise Data Lake.

Part 2, Technical Building Blocks of Data Lake, introduces reader to many technologies which will be part of the Data Lake implementation. Each chapter covers a technology which will slowly build the Data Lake and the use case namely Single Customer View (SCV). Almost all the important technical details of the technology being discussed in each chapter would be covered in a holistic fashion as in-depth coverage is out of scope of this book. It consists of six chapters and each chapter has a goal well defined to be achieved as detailed below.

Chapter 5, Data Acquisition of Batch Data using Apache Sqoop, delves deep into Apache Sqoop. It gives reasons for this choice and also gives the reader other technology options with good amount of details. The chapter also gives a detailed example connecting Data Lake and Lambda architecture. In this chapter the reader will understand Sqoop framework and similar tools in the space for data loads from an enterprise data source into a Data Lake. The reader will understand the technical details around Sqoop and architecturally the problems that it solves. The reader will also be taken through examples, where the Sqoop will be seen in action and various steps involved in using it with Hadoop technologies.

Chapter 6, Data Acquisition of Stream Data using Apache Flume, delves deep into Apache Flume, thus connecting technologies in purview of Data Lake and Lambda architecture. The reader will understand Flume as a framework and its various patterns by which it can be leveraged for Data Lake. The reader will also understand the Flume architecture and technical details around using it to acquire and consume data using this framework in detail, with specific capabilities around transaction control and data replay with working example. The reader will also understand how to use flume with streaming technologies for stream based processing.

Chapter 7, Messaging Layer using Apache Kafka, delves deep into Apache Kafka. This part of the book initially gives the reader the reason for choosing a particular technology and also gives details of other technology options. . In this chapter, the reader would understand Kafka as a message oriented middleware and how it’s compared with other messaging engines. The reader will get to know details around Kafka and its functioning and how it can be leveraged for building scale-out capabilities, from the perspective of client (publisher), broker and consumer(subscriber). This reader will also understand how to integrate Kafka with Hadoop components for acquiring enterprise data and what capabilities this integration brings to Data Lake.

Chapter 8, Data Processing using Apache Flink, the reader in this chapter would understand the concepts around streaming and stream based processing, and specifically in reference to Apache Flink. The reader will get deep into using Apache Flink in context of Data Lake and in the Big Data technology landscape for near real time processing of data with working examples. The reader will also realize how a streaming functionality would depend on various other layers in architecture and how these layers can influence the streaming capability.

Chapter 9, Data Storage using Apache Hadoop, delves deep into Apache Hadoop. In this chapter, the reader would get deeper into Hadoop Landscape with various Hadoop components and their functioning and specific capabilities that these components can provide for an enterprise Data Lake. Hadoop in context of Data Lake is explained at an implementation level and how Hadoop frameworks capabilities around file storage, file formats and map-reduce capabilities can constitute the foundation for a Data Lake and specific patterns that can be applied to this stack for near real time capabilities.

Chapter 10, Indexed Data Store using Elasticsearch, delves deep into Elasticsearch. The reader will understand Elasticsearch as data indexing framework and various data analyzers provided by the framework for efficient searches. The reader will also understand how elasticsearch can be leveraged for Data Lake and data at scale with efficient sharding and distribution mechanisms for consistent performance. The reader will also understand how elasticsearch can be used for fast streaming and how it can used for high performance applications with working examples.

Part 3, Bringing it all together, will bring together technical components from part one and two of this book to give you a holistic picture of Data Lakes. We will bring in additional concepts and technologies in a brief fashion so that, if needed, you can explore those aspects in more detail according to your enterprise requirements. Again, delving deep into the technologies covered in this chapter is out of the scope of this book. But we want you to be aware of these additional technologies and how they can be brought into our Data Lake implementation if the need arises. It consists of two chapters, and each chapter has a goal well defined to be achieved, as detailed here.

Chapter 11, Data Lake components working together, after introducing reader into Data Lake, Lambda architecture, various technologies, this chapter brings the whole puzzle together and brings in a holistic picture to the reader. The reader at this stage should feel accomplished and can take in the codebase as is into the organization and show it working. In this chapter, the reader, would realize how to integrate various aspects of Data Lake to implement a fully functional Data Lake. The reader will also realize the completeness of Data Lake with working examples that would combine all the learning from previous chapters into a running implementation.

Chapter 12, Data Lake Use Case Suggestions, throughout the book the reader is taken through a use case in the form of “Single Customer View”; however while going through the book, there are other use cases in pipeline relevant to their organization which reader can start thinking. This provoking of thought deepens into bit more during this chapter. The reader will understand and realize various use cases that can reap great benefits from a Data Lake and help optimize their cost of ownership, operations, reactiveness and help these uses with required intelligence derived from data. The reader, in this chapter, will also realize the variety of these use cases and the extents to which an enterprise Data Lake can be helpful for each of these use cases.

What you need for this book

This book is for developers, architects, and product/project owners, for realizing Lambda-architecture-based Data Lakes for Enterprises. This book comprises working examples to help the reader understand and observe various concepts around Data Lake and its basic implementation. In order to run the examples, one will need access to various pieces of open source software, required infrastructure, and development IDE. Specific efforts have been made to keep the examples simple and leverage commonly available frameworks and components. The operating system used for running these examples is CentOS 7, but these examples can run on any flavour of the Linux operating system.

Who this book is for

Java developers and architects who would like to implement Data Lake for their enterprise
Java developers who aim to get hands-on experience on Lambda Architecture and Big Data technologies
Java developers who would like to discover the world of Big Data and have an urge to implement a practical solution using those technologies.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Rename the completed spool file to spool-1 as specified in the earlier example."

A block of code is set as follows:

agent.sources = spool-source
agent.sources.spool-source.type=spooldir
agent.sources.spool-source.spoolDir=/home/centos/flume-data
agent.sources.spool-source.interceptors=ts uuid

Any command-line input or output is written as follows:

${FLUME_HOME}/bin/flume-ng agent --conf ${FLUME_HOME}/conf/  -f ${FLUME_HOME}/conf/spool-fileChannel-kafka-flume-conf.properties  -n agent -Dflume.root.logger=INFO,console

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: " Without minimum or no delay (NRT: Near Real Time or Real time) the company wanted the data produced to be moved to Hadoop system"

Note

Warnings or important notes appear in a box like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Data-Lake-for-Enterprises. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/supportand enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

Related Content you might be interested in

Current Title:

Data Lake for Enterprises

Modern Big Data Processing with Hadoop

Mastering Hadoop 3

Apache Hadoop 3 Quick Start Guide

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Note

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions