Book Image

Principles of Data Fabric

By : Sonia Mezzetta
Book Image

Principles of Data Fabric

By: Sonia Mezzetta

Overview of this book

Data can be found everywhere, from cloud environments and relational and non-relational databases to data lakes, data warehouses, and data lakehouses. Data management practices can be standardized across the cloud, on-premises, and edge devices with Data Fabric, a powerful architecture that creates a unified view of data. This book will enable you to design a Data Fabric solution by addressing all the key aspects that need to be considered. The book begins by introducing you to Data Fabric architecture, why you need them, and how they relate to other strategic data management frameworks. You’ll then quickly progress to grasping the principles of DataOps, an operational model for Data Fabric architecture. The next set of chapters will show you how to combine Data Fabric with DataOps and Data Mesh and how they work together by making the most out of it. After that, you’ll discover how to design Data Integration, Data Governance, and Self-Service analytics architecture. The book ends with technical architecture to implement distributed data management and regulatory compliance, followed by industry best practices and principles. By the end of this data book, you will have a clear understanding of what Data Fabric is and what the architecture looks like, along with the level of effort that goes into designing a Data Fabric solution.
Table of Contents (16 chapters)
Part 1: The Building Blocks
Part 2: Complementary Data Management Approaches and Strategies
Part 3: Designing and Realizing Data Fabric Architecture

Why is Data Fabric important?

Data Fabric enables businesses to leverage the power of connected, trusted, protected, and secure data no matter where it’s geographically located or stored (cloud, multi-cloud, hybrid cloud, on-premises, or the edge). Data Fabric handles the diversity of data, use cases, and technologies to create a holistic end-to-end picture of data with actionable insights. It addresses the shortcomings of previous data management solutions while considering lessons learned and building on industry best practices. Data Fabric’s approach is based on a common denominator, metadata. Metadata is the secret sauce of Data Fabric architecture, along with automation enabled by machine learning and artificial intelligence (AI), deep Data Governance, and knowledge management. All these aspects lead to the efficient and effective management of data to achieve business outcomes, therefore cutting down on operational costs and increasing profit margins through strategic decision-making.

Some of the key benefits of Data Fabric are as follows:

  • It addresses data silos with actionable insights from a connected view of disparate data across environments (cloud, multi-cloud, hybrid cloud, on-premises, or the edge) and geographies
  • Data democratization leads to a shorter time to business value with frictionless Self-Service data access
  • It establishes trusted, secure, and reliable data via automated Data Governance and knowledge management
  • It enables a business user with intuitive discovery, understanding, and access to data while addressing a technical user’s needs, supporting various data processing techniques in order to manage data. Such approaches are batch or real time, including ETL/ELT, data virtualization, change data capture, and streaming

Now that we have a view of why Data Fabric is important and how it takes a modern approach to data management, let’s review some of the drawbacks of earlier data management approaches.

Drawbacks of centralized data management

Data is spread everywhere: on-premises, across cloud environments, and on different types of databases, such as SQL, NoSQL, data lakes, data warehouses, and data lakehouses. Many of the challenges associated with this in the past decade, such as data silos, still exist today. The traditional data management approach to analytics is to move data into a centralized data storage system. Moving data into one central system facilitates control and decreases the necessary checkpoints across the large number of different environments and data systems. Thinking about this logically, it makes total sense. If you think about everyday life, we are successful at controlling and containing things if they are in one central place.

As an example, consider the shipment of goods from a warehouse to a store that requires inspection during delivery. Inspecting the shipment of goods in one store will require a smaller number of people and resources as opposed to accomplishing this for 100 stores located across different locations. Seamless management and quality control become a lot harder to achieve across the board. The same applies to data management, and this is what led to the solution of centralized data management.

While centralized data management was the de facto approach for decades and is still used today, it has several shortcomings. Data movement and integration come at an expensive cost, especially when dealing with on-premises data storage solutions. It heavily relies on data duplication to satisfy a diverse set of use cases requiring different contexts. Complex and performance-intensive data pipelines built to enable data movement require intricate maintenance and significant infrastructure investments, especially if automation or governance is nowhere in the picture. In a traditional operating model, IT departments centrally manage technical platforms for business domains. In the past and still today, this model creates bottlenecks in the delivery of and access to data, minimizing the time to value.

Enterprise data warehouses

Enterprise data warehouses are complex systems that require consensus across business domains on common definitions of data. An enterprise data model is tightly coupled to data assets. Any changes to the physical data model without proper dependency management breaks downstream consumption. There are also challenges in Data Quality, such as data duplication and the lack of business skills to manage data within the technical platform team.

Data lakes

Data lakes came after data warehouses to offer a flexible way of loading data quickly without the restrictions of upfront data modeling. Data lakes can load raw data as is and later worry about its transformation and proper data modeling. Data lakes are typically managed in NoSQL databases or file-based distributed storage such as Hadoop. Data lakes support semi-structured and unstructured data in addition to structured data. Challenges with data lakes come from the very fact that they bypass the need to model data upfront, therefore creating unusable data without any proper business context. Such data lakes have been referred to as data swamps, where the data stored has no business value.

Data lakehouses

Data lakehouses is a new technology and is a combination of both Data Warehouse and Data Lake design. Data lakehouses support structured, unstructured and semi-structured data and are capable of addressing data science and business intelligence use cases.

Decentralized data management

While there are several great capabilities in centralized data systems, such as data warehouses, data lakes, and data lakehouses, the reality is, we are at a time where all these systems have a role and create the need for decentralized data management. A single centralized data management system is not equipped to handle all possible use cases in an organization and at the same time excel in proper data management. I’m not saying there is no need for a centralized data system, but rather, it can represent a progression. For example, a small company might start with one centralized system that fits their business needs, and as they grow, they evolve into more decentralized data management.

Another example is a business domain within a large company that might own and manage a data lake, or a data lakehouse that needs to co-exist with several other data systems owned by other business domains. This again represents decentralized data management. Cloud technologies have further provoked the proliferation of data. There is a multitude of cloud providers with their own set of capabilities and cost incentives, leading to organizations having multi-cloud and hybrid cloud environments.

We have evolved from a world of centralized data management as the best practice to a world in which decentralized data management is necessary. There is a seat at the table for all types of centralized systems. What’s important is for these systems to have a data architecture that connects data in an intelligent and cohesive manner. This means a data architecture with the right level of control and rigor while balancing quick access to trusted data, which is where Data Fabric architecture plays a major role.

In the next section, let’s briefly discuss considerations in building Data Fabric architecture.

Building Data Fabric architecture

Building Data Fabric architecture is not an easy undertaking. It’s not a matter of building a simple 1-2-3 application or applying specific technologies. It requires collaboration, business alignment, and strategic thinking about the design of the data architecture; the careful evaluation and selection of different tools, data storage systems, and technologies; and thought into when to buy or build. Metadata is the common thread that ties data together in a Data Fabric design. Metadata must be embedded into every aspect of the life cycle of data from start to finish. Data Fabric actively manages metadata, which enables scalability and automation and creates a design that can handle the growing demands of businesses. It offers a future-proof design that can grow to add subsequent tools and technologies.

Now, with this in mind, let’s introduce a bird’s-eye view of a Data Fabric design by discussing its building blocks.