Book Image

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra
Book Image

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.
Table of Contents (23 chapters)
Title Page
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Part 1 - Overview
Part 2 - Technical Building blocks of Data Lake
Part 3 - Bringing It All Together

Enterprise Data Management


Ability of an organization to precisely define, easily integrate and effectively retrieve data for both internal applications and external communication

- Wikipedia

EDM emphasizes data precision, granularity and meaning and is concerned with how the content is integrated into business applications as well as how it is passed along from one business process to another.

- Wikipedia

As the preceding wikipedia definition clearly states, EDM is the process or strategy of determining how this enterprise data needs to be stored, where it has to be stored, and what technologies it has to use to store and retrieve this data in an enterprise. Being very valuable, this data has to be secured using the right controls and needs to be managed and owned in a defined fashion. It also defines how the data can be taken out to communicate with both internal and external applications alike. Furthermore, the policies and processes around the data exchange have to be well defined.

Looking at the previous paragraph, it seems that it is very easy to have EDM in place for an enterprise, but in reality, it is very difficult. In an enterprise, there are multiple departments, and each department churns out data; based on the significance of these departments, the data churned would also be very relevant to the organization as a whole. Because of the distinction and data relevance, the owner of each data in EDM has different interests, causing conflicts and thus creating problems in the enterprise. This calls for various policies and procedures along with ownership of each data in EDM.

In the context of this book, learning about enterprise data, enterprise data management, and issues around maintaining an EDM are quite significant. This is the reason why it's good to know these aspects at the start of the book itself. In the following sections we will discuss big data concepts and ways in which big data can be incorporated into enterprise data management and extend its capabilities with opportunities that could not be imagined without big data technologies.