Book Image

Data Lake for Enterprises

By : Vivek Mishra, Tomcy John, Pankaj Misra
Book Image

Data Lake for Enterprises

By: Vivek Mishra, Tomcy John, Pankaj Misra

Overview of this book

The term "Data Lake" has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together. This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient. By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.
Table of Contents (23 chapters)
Title Page
About the Authors
About the Reviewers
Customer Feedback
Part 1 - Overview
Part 2 - Technical Building blocks of Data Lake
Part 3 - Bringing It All Together

Enterprise’s current state

As explained briefly in the previous sections, the current state of enterprise data in an organization can be summarized in bullets points as follows:

  • Conventional DW (Data Warehouse) /BI (Business Intelligence):
    • Refined/ cleansed data transferred from production business application using ETL.
    • Data earlier than a certain period would have already been transferred to a storage, which is hard to retrieve, such as magnetic tape storage.
    • Some of its notable deficiencies are as follows:
      • A subset of production data in a cleansed format exists in DW; for any new element in DW, effort has to be made
      • A subset of the data is again in DW, and the rest gets transferred to permanent storage
      • Usually, analysis is really slow, and it is optimized again to perform queries, which are, to an extent, defined
  • Siloed Big Data:
    • Some departments would have taken the right step in building big data. But departments generally don’t collaborate with each other, and this big data becomes siloed and doesn't give the value of a true big data for the enterprise.
    • Some of its deficiencies are as follows:
      • Because of its siloed nature, the analyst is again constrained and not able to mix and match data between departments.
      • A good amount of money would have been spent to build and maintain/manage this and usually over a period of time is not sustainable.
  • Myriad of non-connected applications:
    • There is a good amount of applications on premises and on cloud.
    • Applications apart from churning structured data also produce unstructured data.
    • Some of the deficiencies are as follows:
      • Don't talk to each other
      • Even if it talks, data scientists are not able to use it in an effective way to transform the enterprise in a meaningful way
      • Replication of technology usage for handling many aspects in each business application

We wouldn't say that creating or investing in Data lake is a silver bullet to solve all the aforementioned deficiencies. But it is definitely a step in the right direction, and every enterprise should at least spend some time discussing whether this is indeed required, and if it is a yes, don't deliberate over it too much and take the next step in the path of implementation.

Data lake is an enterprise initiative, and when built, it has to be with the consent of all the stakeholders, and it should have buy-ins from the top executives. It can definitely find ways to improve processes by which enterprises do business. It can help the higher management know more about their business and can increase the success rate of the decision-making process.