Book Image

Data Lake Development with Big Data

Book Image

Data Lake Development with Big Data

Overview of this book

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications. This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.
Table of Contents (13 chapters)

Data Lake architecture


The previous sections made an effort to introduce you to the high-level concepts of the whys and whats of a Data Lake. We have now come to the last section of this chapter where you will be exposed to the internals of a Data Lake. We will take a deep dive into the architecture of Data Lake and understand the key components.

Architectural considerations

In our experience, it is practically difficult to come up with a one-size-fit-all architecture for a Data Lake. In every assignment that we have worked on earlier, we had to deal with specific tailored requirements that made us adapt the architecture to the use case.

The reason why there are multiple interpretations of the Data Lake architecture is that it totally depends on the following factors that are specific to an organization and also the business questions that the Data Lake ought to solve. To realize any of the combinations of these factors in the Data Lake, we tweaked the architecture. Here is a quick list:

  • Type of data ingest (real-time ingest, micro-batch ingest, and batch ingest)

  • Storage tier (raw versus structured)

  • Depth of metadata collection

  • Breadth of data governance

  • Ability to search for data

  • Structured data storage (SQL versus a variety of NoSQL databases such as graph, document, and key-value stores)

  • Provisioning of data access (external versus internal)

  • Speed to insights (optimized for real-time versus batch)

As you can decipher from the preceding points, there are many competing and contradictory requirements that go into building Data Lake capability. Architecting a full-blooded, production-ready Data Lake in reality takes these combinations of requirements into consideration and puts the best foot forward.

For the purpose of this book, we prefer taking a median approach for architecting a Data Lake. We believe that this approach would appeal to most of the readers who want to grasp the overarching potential of a fully blown Data Lake in the way it should be in its end state, rather than being tied down to narrow the interpretations of specific business requirements and overfit the architecture.

Architectural composition

For the ease of understanding, we might consider abstracting much detail and think of the Data Lake as composed of three layers and tiers.

Layers are the common functionality that cut across all the tiers. These layers are listed as follows:

  • Data Governance and Security Layer

  • Metadata Layer

  • Information Lifecycle Management Layer

Tiers are abstractions for a similar functionality grouped together for the ease of understanding. Data flows sequentially through each tier. While the data moves from tier to tier, the layers do their bit of processing on the moving data. The following are the three tiers:

  • Intake Tier

  • Management Tier

  • Consumption Tier

The following figure simplifies the representation of each tier in relation to the layers:

Data Lake end state architecture

Architectural details

In this section, let us go deeper and understand each of the layers and tiers of the Data Lake.

Understanding Data Lake layers

In this section, we will gain a high-level understanding of the relevance of the three horizontal layers

The Data Governance and Security Layer

The Data Governance and Security layer fixes the responsibility for governing the right data access and the rights for defining and modifying data. This layer makes sure that there is a well-documented process for the change and access control of all the data artifacts. The governance mechanism oversees methods for creation, usage, and tracking of the data lineage across various tiers of the Data Lake so that it can be combined with the security rules.

As the Data Lake stores a lot of data from various sources, the Security layer ensures that the appropriate access control and authentication provides the access to data assets on a need-to-know basis. In a practical scenario; if the data consists of both transaction and historical data, along with customer, product, and finance data, which is internally sourced, as well as from third-party sources, the security layer ensures that each subject area of the data has the applicable level of security.

This layer ensures appropriate provisioning of data with relevant security measures put in place. Hadoop's security is taken care by the inbuilt integration with Kerberos, and it is possible to ensure that the users are authenticated before they access the data or compute resources.

The following figure shows the capabilities of the Data Governance and Security layer:

The Data Governance and Security layer

The Information Lifecycle Management layer

As the Data Lake advocates a Store-All approach to huge volumes of Big Data, it is exciting to store everything in it. The Information Lifecycle Management (ILM) layer ensures that there are rules governing what we can or cannot store in the Data Lake. This is because over longer periods of time, the value of data tends to decrease and the risks associated with storage increases. It does not make practical sense to fill the lake continuously, without some plan to down tier the data that has lost its use-by date; this is exactly what the ILM layer strives to achieve.

This layer primarily defines the strategy and policies for classifying which data is valuable and how long we should store a particular dataset in the Data Lake. These policies are implemented by tools that automatically purge, archive, or down tier data based on the classified policy.

The following figure depicts the high-level functionalities of the Information Lifecycle Management layer:

Information Lifecycle Management layer

The Metadata Layer

The Data Lake stores large quantities of structured and unstructured data and there should be a mechanism to find out the linkages between what is stored and what can be used by whom. The Metadata Layer is the heart of the Data Lake. The following list elucidates the essence of this layer:

  • The Metadata layer captures vital information about the data as it enters the Data Lake and indexes this information so that users can search metadata before they access the data itself. Metadata capture is fundamental to make data more accessible and to extract value from the Data Lake.

  • This layer provides vital information to the users of the Data Lake about the background and significance of the data stored in the Data Lake. For instance, data consumers could also use the metadata and find out whether a million tweets are more valued than a thousand customer records. This is accomplished by intelligently tagging every bit of data as it is ingested.

  • A well-built metadata layer will allow organizations to harness the potential of the Data Lake and deliver the following mechanisms to the end users to access data and perform analytics:

    • Self-Service BI (SSBI)

    • Data as a Service (DaaS)

    • Machine Learning as a Service (MLaaS)

    • Data Provisioning (DP)

    • Analytics Sandbox Provisioning (ASP)

  • The Metadata layer defines the structure for files in a Raw Zone and describes the entities inside the files. Using this base-level description, the schema evolution of the file/record is tracked by a versioning scheme. This will eventually allow you to create associations among several entities and, thereby, facilitate browsing and searching.

The following figure illustrates the various capabilities of the Metadata Layer:

The Metadata Layer

Understanding Data Lake tiers

In the following section, let us take a short tour of each of the three tiers in the Data Lake.

The Data Intake tier

The Data Intake tier includes all the processing services that connect to external sources and the storage area for acquiring variant source data in consumable increments.

The Intake tier has three zones, and the data flows sequentially through these zones. The zones in the Intake tier are as follows:

  • Source System Zone

  • Transient landing zone

  • Raw Zone

Let us examine each zone of this tier in detail:

The Source System Zone

The processing services that are needed to connect to external systems are encapsulated in the Source System Zone. This zone primarily deals with the connectivity and acquires data from the external source systems.

In the Source System Zone, the timeliness of data acquisition from the external sources is determined by specific application requirements. In certain classes of applications, it is required to pull log/sensor data in near-real-time and flag anomalies in real-time. In other classes of applications, it is fine to live with batch data trickling at intervals as long as a day—this class uses all the historical data to perform analysis. The Data Intake tier, therefore, should be architected in consideration to the wide latitude in storage requirements of the aforementioned application needs.

The following figure depicts the three broad types of data that would be ingested and categorized by their timeliness:

The timeliness of Data

The Data Intake tier also contains the required processing that can "PULL" data from external sources and also consume the "PUSHED" data from external sources.

The data sources from which data can be "PULLED" by the Intake tier include the following:

  • Operational Data Stores ODS

  • Data Warehouses

  • Online Transaction Processing Systems (OLTP)

  • NoSQL systems

  • Mainframes

  • Audio

  • Video

Data sources that can "PUSH" data to the Intake tier include the following:

  • Clickstream and machine logs such as Apache common logs

  • Social media data from Twitter and so on

  • Sensor data such as temperature, body sensors (Fitbit), and so on

The Transient Zone

A Transient landing zone is a predefined, secured intermediate location where the data from various source systems will be stored before moving it into the raw zone. Generally, the transient landing zone is a file-based storage where the data is organized by source systems. Record counts and file-size checks are carried out on the data in this zone before it is moved into the raw zone.

In the absence of a Transient Zone, the data will have to go directly from the external sources to the Raw Zone, which could severely hamper the quality of data in the Raw Zone. It also offers a platform for carrying out minimal data validation checks. Let us explore the following capabilities of the Transient Zone:

  • A Transient Zone consolidates data from multiple sources, waits until a batch of data has "really" arrived, creates a basic aggregate by grouping together all the data from a single source, tags data with a metadata to indicate the source of origin, and generates timestamps and other relevant information.

  • It performs a basic validity check on the data that has just arrived and signals the retry mechanism to kick in if the integrity of data is at question. MD5 checks and record counts can be employed to facilitate this step.

  • It can even perform a high-level cleansing of data by removing/updating invalid data acquired from source systems (purely an optional step). It is a prime location for validating data quality from an external source for eventually auditing and tracking down data issues.

  • It can support data archiving. There are situations in which the freshly acquired data is deemed not-so-important and thus can be relegated to an archive directly from the Transient Zone.

The following figure depicts the high-level functionality of the Transient Zone:

Transient Zone capabilities

The Raw Zone

The Raw Zone is a place where data lands from the Transient Zone. This is typically implemented as a file-based storage (Hadoop–HDFS). It includes a "Raw Data Storage" area to retain source data for the active use and archival. This is the zone where we have to consider storage options based on the timeliness of the data.

Batch Raw Storage

Batch intake data is commonly pull-based; we can leverage the power of HDFS to store massive amounts of data at a lower cost. The primary reason for the lower cost is that the data is stored on a low-cost commodity disk. One of the key advantages of Hadoop is its inherent ability to store data without the need for it to comply with any structure at the time of ingestion; the data can be refined and structured as and when needed. This schema-on-read ability avoids the need for upfront data modeling and costly extract transform load (ETL) processing of data before it is stored into the Raw Zone. Parallel processing is leveraged to rapidly place this data into the Raw Zone. The following are the key functionalities of this zone:

  • This is the zone where data is deeply validated and watermarked to track and lineage lookup purposes.

  • Metadata about the source data is also captured at this stage. Any relevant security attributes that have a say in the access control of the data are also captured as metadata. This process will ensure that history is rapidly accessible, enabling the tracking of metadata to allow users to easily understand where the data was acquired from and what types of enrichments are applied as information moves through the Data Lake.

  • The Data Usage rights related to data governance are also captured and applied in this zone.

This zone enables reduced integration timeframes. In the traditional data warehouse model, information is consumed after it has been enriched, aggregated, and formatted to meet specific application needs. You can only consume the canned and aggregated data exhaust of a Data Warehouse. The Data Lake is architected differently to be modular, consisting of several distinct zones. These zones provide multiple consumption opportunities resulting in flexibility for the consumer. Applications needing minimal enrichment can access data from a zone (such as the Raw Zone) found early in the process workflow; bypassing "downstream" zones (such as the Data Hub Zone) reduces the cycle time to delivery. This is time-saving and can be significant to customers and consumers, such as data scientists with the need for fast-paced delivery and minimal enrichment.

The real-time Raw Storage

In many applications, it is mandatory to consume data and react to stimulus in real time. For these applications, the latency of writing the data to disk in a file-based system such as HDFS introduces unacceptable delay. Examples of these classes of applications as discussed earlier, include the GPS-aware mobile applications, or applications that have to respond to events from sensors. An in-memory solution called Gemfire can be used for real-time storage and to respond to events; it responds with a low latency and stores the data at rest in HDFS.

The following figure illustrates the choices we make in the Raw Zone based on the type of data:

Raw Zone capabilities

The Data Management tier

In the preceding section, we discussed the ability of the Data Lake to intake and persist raw data as a precursor to prepare that data for migration to other zones of the Lake. In this section, we will see how that data moves from Raw to the Data Management tier in preparation for consumption and more sophisticated analytics.

The Management tier has three zones: the data flows sequentially from the Raw Zone to the Integration Zone through the Enrichment Zone and then finally after all the processes are complete, the final data is stored in a ready-to-use format in the Data Hub that is a combination of relational or NOSQL databases. The zones in the Management tier are as follows:

  • The Integration Zone

  • The Enrichment Zone

  • The Data Hub Zone

As the data moves into the Management Zone, metadata is added and attached to each file. Metadata is a kind of watermark that tracks all changes made to each individual record. This tracking information, as well as activity logging and quality monitoring are stored in metadata that is persisted as the data moves through each of the zones. This information is extremely important to be able to report on the progress of the data through the Lake and will be used to expedite the investigation of anomalies and corrections needed to ensure quality information is delivered to the consuming applications. This metadata also helps in data discovery.

The Integration Zone

The Integration Zone's main functionality is to integrate various data and apply common transformations on the raw data into a standardized, cleansed structure that is optimized for data consumers. This zone eventually paves the way for storing the data into the Data Hub Zone. The key functionalities of the Integration Zone are as follows:

  • Processes for automated data validation

  • Processes for data quality checks

  • Processes for integrity checks

  • Associated operational management's audit logging and reporting

Here is a visual representation of the key functionalities of the Integration Zone:

Integration Zone capabilities

The Enrichment Zone

The Enrichment Zone provides processes for data enhancement, augmentation, classification, and standardization. It includes processes for automated business rules' processing and processes to derive or append new attributes to the existing records from internal and external sources.

Integration and enrichments are performed on a file-based HDFS rather than a traditional relational data structure, because a file-based computing is advantageous—as the usage patterns of the data have not been determined yet, we have extreme flexibility within a file system. HDFS natively implements a schemaless storage system. The absence of a schema and indexes means you do not need to preprocess the data before you can use it. This means it loads faster and the structure is extensible, allowing it to flex as business needs change.

The following figure depicts the key functionalities of the Enrichment Zone:

The Enrichment Zone's capabilities

The Data Hub Zone

The Data Hub Zone is the final storage location for cleaned and processed data. After the data is transformed and enriched in the downstream zones, it is finally pushed into the Data Hub for consumption.

The Data Hub is governed by a discovery process that is internally implemented as search, locate, and retrieve functionality through tools such as Elasticsearch or Solr/Lucene. A discovery is made possible by the extensive metadata that has been collected in the previous zones.

The data hub stores relational data in common relational databases such as Oracle and MS SQL server. It stores non-relational data in related technologies (for example, Hbase, Cassandra, MongoDB, Neo4J, and so on.)

The following figure depicts the capabilities of the Data Hub Zone:

Data Hub Zone capabilities

The Data Consumption tier

In the preceding section, we discussed the capability of the zones in the Data Lake to move data from Raw to the Data Integration Zone. In this section, we will discuss the ways in which data is packaged and provisioned for consumption for more sophisticated analytics.

The Consumption tier is where the data is accessed either in raw format from the Raw Zone or in the structured format from the Data Hub. The data is provisioned through this tier for external access for analytics, visualization, or other application access through web services. The data is discovered by the data catalog published in the consumption zone and this actual data access is governed by security controls to limit unwarranted access.

The Data Discovery Zone

The Data Discovery Zone is the primary gateway for external users into the Data Lake. The key to implement a functional consumption tier is the amount and quality of Metadata that we would have collected in the preceding zones and the intelligent way in which we will expose this metadata for search and data retrieval. Too much governance on the metadata might miss the relevant search results and too little governance could jeopardize the security and integrity of the data.

Data discovery also uses data event logs that is a part of the Metadata, in order to query the data. All services that act on data in all the zones are logged along with their statuses, so that the consumers of data can understand the complete lineage of how data was impacted overtime. The Data Event Logging combined with metadata will enable extensive data discovery and allow users to explore and analyze data. In summary, this zone provides a facility to data consumers to browse, search, and discover the data.

Data discovery provides an interface to search data using the metadata or the data content. This interface provides flexible, self-driven data discovery capabilities that enable the users to efficiently find and analyze relevant information.

The Data Provisioning Zone

Data Provisioning allows data consumers to source/consume the data that is available in the Data Lake. This tier is designed to allow you to use the metadata that specify the "publications" that need to be created, the "subscription" specific customization requirements, and the end delivery of the requested data to the "data consumer." The Data Provisioning is done on the entire data that is residing in the Data Lake. The data that is provisioned can be either in the Raw Zone or in the Data Hub Zone.

The following figure depicts the important features of the Consumption tier:

Consumption Zone capabilities