Book Image

Data Ingestion with Python Cookbook

By : Gláucia Esppenchutz
Book Image

Data Ingestion with Python Cookbook

By: Gláucia Esppenchutz

Overview of this book

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges. You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation. By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.
Table of Contents (17 chapters)
1
Part 1: Fundamentals of Data Ingestion
9
Part 2: Structuring the Ingestion Pipeline

Implementing data replication

Data replication is a process applied in data environments to create multiple copies of data and store them on different locations, servers, or sites. This technique is commonly implemented to create better availability and avoid data loss if there is downtime, or even a natural disaster that affects a data center.

Getting ready

You will find across papers and articles different types (or even names) on the best way for data replication decision. In this recipe, you will learn how to decide which kind of replication better suits your application or software.

How to do it…

Let’s begin to build our fundamental pillars to implement data replication:

  1. First, we need to decide the size of our replication, and it can be done using a portion or all the stored data.
  2. The next step is to consider when replication will take place. It can be done synchronously when new data arrives in storage or within a specific timeframe.
  3. The last fundamental pillar is whether the data is incremented or in a bulk form.

In the end, we will have a diagram that looks like the following:

Figure 1.21 – A data replication model decision diagram

Figure 1.21 – A data replication model decision diagram

How it works…

Analyzing the preceding figure, we have three main questions to answer, regarding the extension, the frequency, and whether our replication will be incremental or bulk.

For the first question, we decide whether the replication will be complete or partial. In other words, either the data will consistently be replicated no matter what type of transaction or change was made, or just a portion of the data will be replicated. A real example of this would be keeping track of all store sales or just the most expensive ones.

The second question, related to the frequency, is to decide when a replication needs to be done. This question also needs to take into consideration related costs. Real-time replication is often more expensive, but the synchronicity guarantees almost no data inconsistency.

Lastly, it is relevant to consider how data will be transported to the replication site. In most cases, a scheduler with a script can replicate small data batches and reduce transportation costs. However, a bulk replication can be used in the data ingestion process, such as copying all the current batch’s raw data from a source to cold storage.

There’s more…

One method of data replication that has seen an increase in use in the past few years is cold storage, which is used to retain data used infrequently or is even inactive. The costs related to this type of replication are meager and guarantee data longevity. You can find cold storage solutions in all cloud providers, such as Amazon Glacier, Azure Cool Blob, and Google Cloud Storage Nearline.

Besides replication, regulatory compliance such as General Data Protection Regulation (GDPR) laws benefit from this type of storage, since, for some case scenarios, users’ data need to be kept for some years.

In this chapter, we explored the basic concepts and laid the foundation for the following chapters and recipes in this book. We started with a Python installation, prepared our Docker containers, and saw data governance and replication concepts. You will observe over the upcoming chapters that almost all topics interconnect, and you will understand the relevance of understanding them at the beginning of the ETL process.