Book Image

Data Lake Development with Big Data

Book Image

Data Lake Development with Big Data

Overview of this book

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications. This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.
Table of Contents (13 chapters)

Need for Data Lake


Now that we have glimpsed the past and understood how various systems evolved in time, let us explore in this section, a few important reasons why Data Lakes have evolved and what problems they try to address. Let's start with a contextual overview.

One of the key driving forces behind the onslaught of Big Data is the rapid spread of unstructured data (which constitutes 90 percent of the data). The increase in mobile phones, wider internet coverage, faster data networks, cheaper cloud storage, and falling compute/storage prices, all contribute to the spurt of Big Data in recent years. A few studies reveal that we produce as much data every 15 minutes, as was created from the beginning of the time, to the year 2003. This roughly coincides with the mobility/cloud usage proliferation.

Big data is not only about massive data capture and storage at a cheaper price point, but the real value of storing Big Data comes from intelligently combining the historical data that already exists inside an organization with the unstructured data. This helps in gaining new and better insights that improve business performance.

For example, in retail, it could imply better and rapid services to customers; in R&D, it could imply performing iterative tests over much larger samples in a faster way; in healthcare, it could imply quicker and more precise diagnoses of illnesses.

For an organization to be really successful to reap the latent benefits of Big Data, it needs two basic capabilities:

  • Technology should be in place to enable organizations to acquire, store, combine, and enrich huge volumes of unstructured and structured data in raw format

  • Ability to perform analytics, real-time and near-real-time analysis at scale, on these huge volumes in an iterative way

To address the preceding two business needs, the concept of Data Lake has become one of the empowering data captures and processing capabilities for Big Data analytics.

The Data Lake makes it possible to store all the data, ask complex and radically bigger business questions, and find out hidden patterns and relationships from the data.

Using a traditional system, an enterprise may not have the solution to find out whether there is any hidden value in the data that the enterprise is not storing right now or letting go as waste. We don't really know what hidden value this data contains at the time of data acquisition. We might know a miniscule percentage of questions to ask at the data acquisition time, but we can never know what questions could materialize at a later point of time. Essentially, a Data Lake tries to address this core business problem.

While reasons abound to explain the need of a Data Lake, one of the core reasons is the dramatic decrease of storage costs and thus enabling organizations to store humungous amounts of data.

Let us look at a few reasons for the emergence of the Data Lake with reference to the aspects in which the traditional approaches fail:

  • The traditional data warehouse (DW) systems are not designed to integrate, scale, and handle this exponential growth of multi-structured data. With the emergence of Big Data, there is a need to bring together data from disparate sources and to generate a meaning out of it; new types of data ranging from social text, audio, video, sensors, and clickstream data have to be integrated to find out complex relationships in the data.

  • The traditional systems lack the ability to integrate data from disparate sources. This leads to proliferation of data silos, due to which, business users view data in various perspectives, which eventually thwarts them from making precise and appropriate decisions.

  • The schema-on-write approach followed by traditional systems mandate the data model and analytical framework to be designed before any data is loaded. Upfront data modeling fails in a Big Data scenario as we are unaware of the nature of incoming data and the exploratory analysis that has to be performed in order to gain hidden insights. Analytical frameworks are designed to answer only specific questions identified at the design time. This approach does not allow for data discovery.

  • With traditional approaches, optimization for analytics is time consuming and incurs huge costs. Such optimization enables known analytics, but fails when there are new requirements.

  • In traditional systems, it is difficult to identify what data is available and to integrate the data to answer a question. Metadata management and lineage tracking of data is not available or difficult to implement; manual recreation of data lineage is error-prone and consumes a lot of time.