Book Image

Data Ingestion with Python Cookbook

By : Gláucia Esppenchutz
Book Image

Data Ingestion with Python Cookbook

By: Gláucia Esppenchutz

Overview of this book

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges. You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation. By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.
Table of Contents (17 chapters)
1
Part 1: Fundamentals of Data Ingestion
9
Part 2: Structuring the Ingestion Pipeline

Applying data governance in ingestion

Data governance is a set of methodologies that ensure that data is secure, available, well-stored, documented, private, and accurate.

Getting ready

Data ingestion is the beginning of the data pipeline process, but it doesn’t mean data governance is not heavily applied. The governance status in the final data pipeline output depends on how it was implemented during the ingestion.

The following diagram shows how data ingestion is commonly conducted:

Figure 1.18 – The data ingestion process

Figure 1.18 – The data ingestion process

Let’s analyze the steps in the diagram:

  1. Getting data from the source: The first step is to define the type of data, its periodicity, where we will gather it, and why we need it.
  2. Writing the scripts to ingest data: Based on the answers to the previous step, we can begin planning how our code will behave and some basic steps.
  3. Storing data in a temporary database or other types of storage: Between the ingest and the transformation phase, data is typically stored in a temporary database or repository.
Figure 1.19 – Data governance pillars

Figure 1.19 – Data governance pillars

How to do it…

Step by step, let’s attribute the pillars in Figure 1.19 to the ingestion phase:

  1. A concern for accessibility needs to be applied at the data source level, defining the individuals that are allowed to see or retrieve data.
  2. Next, it is necessary to catalog our data to understand it better. Since data ingestion is only covered here, it is more relevant to cover the data sources.
  3. The quality pillar will be applied to the ingestion and staging area, where we control the data and keep its quality aligned with the source.
  4. Then, let’s define ownership. We know the data source belongs to a business area or a company. However, when we ingested the data and put it in temporary or staging storage, it becomes our responsibility to maintain it.
  5. The last pillar involves keeping data secure for the whole pipeline. Security is vital in all steps, since we may be handling private or sensitive information.
Figure 1.20 – Adding to data ingestion

Figure 1.20 – Adding to data ingestion

How it works…

While some articles define “pillars” to create governance good practices, the best way to understand how to apply them is to understand how they are composed. As you saw in the previous How to do it… section, we attributed some items to our pipeline, and now we can understand how they are connected to the following topics:

  • Data accessibility: Data accessibility is how people from a group, organization, or project can see and use data. The information needs to be readily available for use. At the same time, it needs to be available for the people involved in the process. For example, sensitive data accessibility should be restricted to some people or programs. In the diagram we built, we applied it to our data sources, since we need to understand and retrieve data. For the same reason, it can be applied for temporary storage needs as well.
  • Data catalog: Cataloging and documenting data are essential for business and engineering teams. When we know what types of information rely on our databases or data lakes and have quick access to these documents, the action time to solve a problem becomes short.

Again, documenting our data sources can make the ingest process quicker, since we need to make a discovery every time we need to ingest data.

  • Data quality: Quality is constantly preoccupied with ingesting, processing, and loading data. Tracking and monitoring data’s expected income and outcome by its periodicity is essential. For example, if we expect to ingest 300 GB of data per day and suddenly it drops to 1 GB, something is very wrong and will affect the quality of our final output. Other quality parameters can be the number of columns, partitioning, and so on, which we will explore later in this book.
  • Ownership: Who is responsible for the data? This definition is crucial to make contact with the owner if there are problems or attribute responsibility to keep and maintain data.
  • Security: A concerning topic nowadays is data security. With so many regulations about data privacy, it became an obligation of data engineers and scientists to know, at least, the basics of encryption, sensitive data, and how to avoid data leaks. Even languages and libraries that are used for work need to be evaluated. That’s why this item is attributed to the three steps in Figure 1.19.

In addition to the topics we explored, a global data governance project has a vital role called a data steward, which is responsible for managing an organization’s data assets and ensuring that data is accurate, consistent, and secure. In summary, data stewardship is managing and overseeing an organization’s data assets.

See also

You can read more about a recent vulnerability found in one of the most used tools for data engineering here: https://www.ncsc.gov.uk/information/log4j-vulnerability-what-everyone-needs-to-know.