Book Image

Modern Data Architectures with Python

By : Brian Lipp
3 (1)
Book Image

Modern Data Architectures with Python

3 (1)
By: Brian Lipp

Overview of this book

Modern Data Architectures with Python will teach you how to seamlessly incorporate your machine learning and data science work streams into your open data platforms. You’ll learn how to take your data and create open lakehouses that work with any technology using tried-and-true techniques, including the medallion architecture and Delta Lake. Starting with the fundamentals, this book will help you build pipelines on Databricks, an open data platform, using SQL and Python. You’ll gain an understanding of notebooks and applications written in Python using standard software engineering tools such as git, pre-commit, Jenkins, and Github. Next, you’ll delve into streaming and batch-based data processing using Apache Spark and Confluent Kafka. As you advance, you’ll learn how to deploy your resources using infrastructure as code and how to automate your workflows and code development. Since any data platform's ability to handle and work with AI and ML is a vital component, you’ll also explore the basics of ML and how to work with modern MLOps tooling. Finally, you’ll get hands-on experience with Apache Spark, one of the key data technologies in today’s market. By the end of this book, you’ll have amassed a wealth of practical and theoretical knowledge to build, manage, orchestrate, and architect your data ecosystems.
Table of Contents (19 chapters)
1
Part 1:Fundamental Data Knowledge
4
Part 2: Data Engineering Toolset
8
Part 3:Modernizing the Data Platform
13
Part 4:Hands-on Project

Lakehouse and Delta architectures

Introduced by Databricks, the lakehouse and Delta architectures are a significant improvement over previous data platform patterns. They are the next evolution by combining what works from previous modalities and improving them.

Lakehouses

What is a lakehouse? Why is it so important? It’s talked about often, but few people can explain the tenets of a lakehouse. The lakehouse is an evolution of the data warehouse and the data lake. A lakehouse takes the lessons learned from both and combines them to avoid the flaws of both. There are seven tenets of the lakehouse, and each is taken from one of the parent technologies.

The seven central tenets

Something that’s not always understood when engineers discuss lakehouses is that they're general sets of ideas. Here, we will walk through all seven essential tenets – openness, data diversity, workflow diversity, processing diversity, language-agnostic, decoupled storage and compute, and ACID transactions.

Openness

The openness principle is fundamental to everything in the lakehouse. It influences the preference for open standards over closed-source technology. This choice affects the long-term life of our data and the systems we choose to connect with. When you say a lakehouse is open, you are saying it uses nonproprietary technologies, but it also uses methodologies that allow for easier collaboration, such as decoupled storage and compute engines.

Data diversity

In a lakehouse, all data is welcome and accessible to users. Semi-structured data is given first-class citizenship alongside structured data, including schema enforcement.

Workflow diversity

With workflow diversity, users can access the data in many ways, including via notebooks, custom applications, and BI tools. How the user interacts with the data shouldn’t be limited.

Processing diversity

The lakehouse prioritizes both streaming and batch equally. Not only is streaming important but it also uses the Delta architecture to compress streaming and batch into one technology layer.

Language-agnostic

The goal of the lakehouse is to support all methods of accessing the data and all programming languages. However, this goal is not possible practically. When implemented, the list of methods and languages supported in Apache Spark is extensive.

Decoupling storage and compute

In a data warehouse, data storage is combined with the same technology. From a speed perspective, this is ideal, but it creates a lack of flexibility. When a new processing engine is desired, or a combination of many storage engines is required, the data warehouse’s model fails to perform. A unique characteristic taken from data lakes is decoupling the storage and compute layers, which creates several benefits. The first is flexibility; you can mix and match technologies. You can have data stored in graph databases, cloud data warehouses, object stores, or even systems such as Kafka. The second benefit is the significant cost reduction. This cost reduction comes when you incorporate technologies such as object stores. Cloud object stores such as AWS’s S3, Azure’s Blob, and GCP’s Object Storage represent cheap, effective, and massively scalable data storage. Lastly, when you follow this design pattern, you can scale at a more manageable rate.

ACID transactions

One significant issue with data lakes is the lack of transactional data processing. Transactions have substantial effects on the quality of your data. They affect anything from writing to the same table at the same time to a log of changes made to your table. With ACID transactions, lakehouses are significantly more reliable and effective compared to data lakes.

The medallion data pattern and the Delta architecture

The medallion data pattern is an approach to storing and serving data that is less based on building a data warehouse and more focused on building your raw data off a single source of truth.

Delta architecture

The following diagram shows the Delta architecture. It looks just like Kappa, but the key difference is that you are not forcing batch processing out of real-time data. Both exist within the same layers and systems:

Figure 1.6: Delta architecture

Figure 1.6: Delta architecture

The Delta architecture is a lessons-learned approach to both the Kappa and Lambda architectures. Delta sees that the complexity of trying to use real-time data as the sole data source can be highly complex and, in the majority of companies, overkill. Most companies want workloads in both batch and real-time, not excluding one over the other. Yet, Delta architectures reduce the footprint to one layer. This processing layer can handle batch and real-time with almost identical code. This is a huge step forward from previous architectural patterns.

The medallion data pattern

The medallion data pattern is an organized naming convention that explains the nature of the data being processed. When referencing your tables, labeling them with tags allows for clear visibility into tables.

The following diagram shows the medallion data pattern, which describes the state of each dataset at rest:

Figure 1.7: Medallion architecture

Figure 1.7: Medallion architecture

As you can see, the architecture has different types of tables (data). Let’s take a look.

Bronze

Bronze data is raw data that represents source data without modification, other than metadata. Bronze tables can be used as a single source of truth since the data is in its purest form. Bronze tables are incrementally loaded and can be a combination of streaming and batch data. It might be helpful to add metadata such as source data and processed timestamps.

Silver

Once you have bronze data, you will start to clean, mold, transform, and model your data. Silver data represents the final stage before creating data products. Once your data is in a silver table, it is considered validated, enriched, and ready for consumption.

Gold

Gold data represents your final data products. Your data is curated, summarized, and molded to meet your user’s needs in terms of BI dashboarding, statistical analysis, or machine learning modeling. Often, golden data is stored in separate buckets to allow for data scaling.

Data mesh theory and practice

Zhamak Dehghani created data mesh to overcome many common data platform issues while working for ThoughtWorks in 2019. I often get asked by seasoned data professionals, why bother with data mesh? Isn’t it just data silos all over again? The fundamentals of a data mesh are that it’s a decentralized data domain and scaling-focused but with those ideas also comes rethinking how we organize not only our data but also our whole data practice. By learning and adopting data mesh concepts and techniques, we cannot only produce more valuable data but also better enable users to access our data. No longer do we see orphaned data with little interaction from its creators. Users have direct relationships with data producers, and with that comes higher-quality data.

The following diagram shows the typical spaghetti data pipeline complexity that many growing organizations fall into. Imagine trying to maintain this maze of data pipelines. This is a common scenario for spaghetti data pipelines, which are very brittle and hard to maintain and scale:

Figure 1.8: Classic data pipeline architecture

Figure 1.8: Classic data pipeline architecture

The following diagram shows our data platform once it’s been decentralized, which means it’s free of brittle pipelines and able to scale:

Figure 1.9: Data mesh architecture

Figure 1.9: Data mesh architecture

Anyone who has worked on a more extensive organization’s data team can tell you how complex and challenging things get. You will find a maze of complex data pipelines everywhere. Pipelines are the blood of your data warehouse or data lake. As data is shipped and processed, it’s merged into a central warehouse. It’s common to have lots of data with very little visibility into who knows about the data and the quality of that data. The data sits there, and people who know about it use it in whatever state it lives in. So, let’s say you find that data. What exactly is the data? What is the path or lineage the data has taken to get to its current state? Who should you contact to correct issues with the data? So many questions, and often, the answers are less than ideal.

These were the types of problems Zhamak Dehghani was trying to tackle when she first came up with the idea of a data mesh. Dehghani noticed all the limitations of the current landscape in data organizations and developed a decentralized philosophy heavily influenced by Eric Evans’s domain-driven design. A data mesh is arguably a mix of organizational and technological changes. These changes, when adopted, allow your teams to have a better data experience. One thing I want to make very clear is that data mesh does not involve creating data silos. It involves creating an interconnected network of data that isn’t focused on the technical process but on the functionality concerning the data. Organizationally, a domain in data mesh will have a cross-functional team of data experts, software experts, data product owners, and infrastructure experts.

Defining terms

A data mesh has several terms that need to be explained for us to understand the philosophy fully. The first term that stands out is the data product owner. The data product owner is a member of the business team who takes on the role of the data steward or overseer of the data and is responsible for data governance. If there is an issue with data quality or privacy concerns, the data steward would be the person accountable for that data. Another term that’s often used in a data mesh is domain, which can be understood as an organizational group of commonly focused entities. Domains publish data for other domains to consume. Data products are the heart of the data mesh philosophy. The data products should be self-service data entities that are offered close to creators. What does it mean to have self-service data? Your data is self-service when other domains can search, find, and access your data without having to have any administrative steps. This should all live on a data platform, which isn’t one specific technology but a cohesive network of technologies.

The four principles of data mesh

Let’s look at these four principles – that is, data ownership, data as a product, data availability, and data governance.

Data ownership

Data ownership is a fundamental concept in a data mesh. Data ownership is a partnership between the cross-functional teams within a domain and other domains that are using the data products in downstream apps and data products. In the traditional model, data producers allow a central group of engineers to send their data to a single repository for consumption. This created a scenario where data warehouse engineers were the responsible parties for data. These engineers tried to be the single source of truth when it came to this data. The source is now intimately involved with the data consumption process. This reduces and improves the data quality and removes the need for a central engineering group to manage data. To accomplish this task, teams within a domain have a wide range of skill sets, including software engineers, product owners, DevOps engineers, data engineers, and analytics engineers. So, who ultimately owns the data? It’s very simple – if you produce the data, you own it and are responsible for it.

Data as a product

Data as a product is a fundamental concept that transforms the data that is offered to users. When each domain treats the data that others consume as an essential product, it reinvents our data. We market our data and have a vested interest in that data. Consumers want to know all about the data before they buy our product. So, when we advertise our data, we list the characteristics users need to make that educated decision. These characteristics are things such as lineage, access statistics, data contract, the data product owners, privacy level, and how to access the data. Each product is a uniquely curated creation and versioned to give complete visibility to consumers. Each version of our data product represents a new data contract with downstream users. What is a data contract? It’s an agreement between the producers and consumers of the data. Not only is the data expected to be clean and kept in high quality but the schema of the data is also guaranteed. When a new version comes out, any schema changes must be backward-compatible to avoid breaking changes. This is called schema evolution and is a cornerstone to developing a trusted data product.

Data is available

As a consumer of data in an organization, I should be able to find data easily in some type of registry. The data producers should have accurate metadata about the data products within this registry. When I want to access this data, there should be an automated process that is ideally role-based. In an ideal world, this level of self-service exists to create an ecosystem of data. When we build data systems with this level of availability, we see our data practice grow, and we evolve our data usage.

Data governance

Data governance is a very loaded term with various meanings, depending on the person and the context. In the context of a data mesh, data governance is applied within each domain. Each domain will guarantee quality data that fulfills the data contract, all data meets the privacy level advertised, and appropriate access is granted or revoked based on company policies. This concept is often called federated data governance. In a federated model, governance standards are centrally defined but executed within each domain. Like any other area of a data mesh, the infrastructure can be shared but in a distributed manner. This distributed approach allows for standards across the organization but only via domain-specific implementations.