Book Image

Data Lake Development with Big Data

By : Pradeep Pasupuleti, Beulah Salome Purra
Book Image

Data Lake Development with Big Data

By: Pradeep Pasupuleti, Beulah Salome Purra

Overview of this book

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications. This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.
Table of Contents (13 chapters)

When to go for a Data Lake implementation

In the preceding section, the top benefits of Data Lake were brought to light and we looked at how their application takes on the strategic importance in an organization.

In this section, we will try to enumerate a few key quick reference scenarios where Data Lake can be recommended as a go-to solution. Here are a few scenarios:

  • Your organization is planning to extract insights from huge volumes or a high-velocity of data that a traditional data warehouse is incapable of handling.

  • The business landscape is forcing your organization to adapt to market challenges by making you handle the demand for new products at a moment's notice and you have to get insights really fast.

  • Your organization needs to build data products that use new data that is not yet prepared and structured. As new data becomes available, you may need to incorporate it straightaway, it probably can't wait for a schema change, building the extension and lots of delay, and it needs the insight right now.

  • Your organization needs a dynamic approach in extracting insights from data where business units can tap or purify the required information when they need it.

  • Your organization is looking for ways to reduce the total ownership cost of a data warehouse implementation by leveraging a Data Lake that significantly lowers storage, operational, network, and computing costs and produces better insights.

  • Your organization needs to improve its topline and wants to augment internal data (such as customer data) with external data (social media and nontraditional data) from a variety of sources. This can get a broader customer view and better behavioral profile of the customer, resulting in quicker customer acquisition.

  • The organization's data science/advanced analytics teams seek preserving of the original data's integrity/fidelity and need lineage tracking of data transformations to capture the origin of a specific datum and to track the lifecycle of the data as it moves through the data pipeline.

  • There is a pressing need for the structuring and standardization of Big Data for new and broader data enrichment.

  • There is a need for near real-time analytics for faster/better decisions and point-of-service use.

  • Your organization needs an integrated data repository for plug-and-play implementation of new analytics tools and data products.

  • Your data science/advanced analytics teams regularly need quick provisioning of data without having to be in an endless queue; Data Lake's capability called Data as a Service (DaaS) could be a solution.