Book Image

Data Lake Development with Big Data

Book Image

Data Lake Development with Big Data

Overview of this book

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications. This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.
Table of Contents (13 chapters)

Defining Data Lake


In the preceding sections, we had a quick overview of how the traditional systems evolved over time and understood their shortcomings with respect to the newer forms of data. In this section, let us discover what a Data Lake is and how it addresses the gaps masquerading as opportunities.

A Data Lake has flexible definitions. At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out.

The following list details out in a nutshell what a Data Lake is:

  • Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze.

  • Data Lake is not Hadoop. It uses different tools. Hadoop only implements a subset of functionalities.

  • Data Lake is not a database in the traditional sense of the word. A typical implementation of Data Lake uses various NoSQL and In-Memory databases that could co-exist with its relational counterparts.

  • A Data Lake cannot be implemented in isolation. It has to be implemented alongside a data warehouse as it complements various functionalities of a DW.

  • It stores large volumes of both unstructured and structured data. It also stores fast-moving streamed data from machine sensors and logs.

  • It advocates a Store-All approach to huge volumes of data.

  • It is optimized for data crunching with a high-latency batch mode and it is not geared for transaction processing.

  • It helps in creating data models that are flexible and could be revised without database redesign.

  • It can quickly perform data enrichment that helps in achieving data enhancement, augmentation, classification, and standardization of the data.

  • All of the data stored in the Data Lake can be utilized to get an all-inclusive view. This enables near-real-time, more precise predictive models that go beyond sampling and aid in generating multi-dimensional models too.

  • It is a data scientist's favorite hunting ground. He gets to access the data stored in its raw glory at its most granular level, so that he can perform any ad-hoc queries, and build an advanced model at any time—Iteratively. The classic data warehouse approach does not support this ability to condense the time between data intake and insight generation.

  • It enables to model the data, not only in the traditional relational way, but the real value from the data can emanate from modeling it in the following ways:

    • As a graph to find the interactions between elements; for example, Neo4J

    • As a document store to cluster similar text; for example, MongoDB

    • As a columnar store for fast updates and search; for example, HBase

    • As a key-value store for lightning the fast search; for example, Riak

A key attribute of a Data Lake is that data is not classified when it is stored. As a result, the data preparation, cleansing, and transformation tasks are eliminated; these tasks generally take a lion's share of time in a Data Warehouse. Storing data in its rawest form enables us to find answers from the data for which we do not know the questions yet; whereas a traditional data warehouse is optimized for answering questions that we already know—thus preparation of the data is a mandatory step here.

This reliance on raw data makes it easy for the business to consume just what it wants from the lake and refine it for the purpose. Crucially, in the Data Lake, the raw data makes multiple perspectives possible on the same source so that everyone can get their own viewpoints on the data, in a manner that enables their local business's success.

This flexibility of storing all data in a single Big Data repository and creating multiple viewpoints require that the Data Lake implements controls for corporate data consistency. To achieve this, targeted information governance policies are enforced. Using Master Data Management (MDM), Research Data Management (RDM), and other security controls, corporate collaboration and access controls are implemented.