Data Ingestion with Python Cookbook

By : Gláucia Esppenchutz

Data Ingestion with Python Cookbook

By: Gláucia Esppenchutz

Overview of this book

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges. You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation. By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Download a free PDF copy of this book

Part 1: Fundamentals of Data Ingestion

Free Chapter

Chapter 1: Introduction to Data Ingestion

Technical requirements

Setting up Python and its environment

Installing PySpark

Configuring Docker for MongoDB

Configuring Docker for Airflow

Creating schemas

Applying data governance in ingestion

Implementing data replication

Further reading

Chapter 2: Principals of Data Access – Accessing Your Data

Technical requirements

Implementing governance in a data access workflow

Accessing databases and data warehouses

Accessing SSH File Transfer Protocol (SFTP) ﬁles

Retrieving data using API authentication

Managing encrypted ﬁles

Accessing data from AWS using S3

Accessing data from GCP using Cloud Storage

Further reading

Chapter 3: Data Discovery – Understanding Our Data before Ingesting It

Technical requirements

Documenting the data discovery process

Configuring OpenMetadata

Connecting OpenMetadata to our database

Further reading

Chapter 4: Reading CSV and JSON Files and Solving Problems

Technical requirements

Reading a CSV ﬁle

Reading a JSON ﬁle

Creating a SparkSession for PySpark

Using PySpark to read CSV ﬁles

Using PySpark to read JSON ﬁles

Further reading

Chapter 5: Ingesting Data from Structured and Unstructured Databases

Technical requirements

Configuring a JDBC connection

Ingesting data from a JDBC database using SQL

Connecting to a NoSQL database (MongoDB)

Creating our NoSQL table in MongoDB

Ingesting data from MongoDB using PySpark

Further reading

Chapter 6: Using PySpark with Deﬁned and Non-Deﬁned Schemas

Technical requirements

Applying schemas to data ingestion

Importing structured data using a well-deﬁned schema

Importing unstructured data without a schema

Ingesting unstructured data with a well-deﬁned schema and format

Inserting formatted SparkSession logs to facilitate your work

Further reading

Chapter 7: Ingesting Analytical Data

Technical requirements

Ingesting Parquet ﬁles

Ingesting Avro files

Applying schemas to analytical data

Filtering data and handling common issues

Ingesting partitioned data

Applying reverse ETL

Selecting analytical data for reverse ETL

Further reading

Part 2: Structuring the Ingestion Pipeline

Chapter 8: Designing Monitored Data Workﬂows

Technical requirements

Inserting logs

Using log-level types

Creating standardized logs

Monitoring our data ingest ﬁle size

Logging based on data

Retrieving SparkSession metrics

Further reading

Chapter 9: Putting Everything Together with Airﬂow

Technical requirements

Configuring Airflow

Creating DAGs

Creating custom operators

Conﬁguring sensors

Creating connectors in Airﬂow

Creating parallel ingest tasks

Deﬁning ingest-dependent DAGs

Further reading

Chapter 10: Logging and Monitoring Your Data Ingest in Airﬂow

Technical requirements

Creating basic logs in Airﬂow

Storing log files in a remote location

Configuring logs in airflow.cfg

Designing advanced monitoring

Using notiﬁcation operators

Using SQL operators for data quality

Further reading

Chapter 11: Automating Your Data Ingestion Pipelines

Technical requirements

Installing and running Airflow

Scheduling daily ingestions

Scheduling historical data ingestion

Scheduling data replication

Setting up the schedule_interval parameter

Solving scheduling errors

Further reading

Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime

Technical requirements

Setting up StatsD for monitoring

Setting up Prometheus for storing metrics

Setting up Grafana for monitoring

Creating an observability dashboard

Setting custom alerts or notiﬁcations

Further reading

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Applying data governance in ingestion

Data governance is a set of methodologies that ensure that data is secure, available, well-stored, documented, private, and accurate.

Getting ready

Data ingestion is the beginning of the data pipeline process, but it doesn’t mean data governance is not heavily applied. The governance status in the final data pipeline output depends on how it was implemented during the ingestion.

The following diagram shows how data ingestion is commonly conducted:

Figure 1.18 – The data ingestion process

Let’s analyze the steps in the diagram:

Getting data from the source: The first step is to define the type of data, its periodicity, where we will gather it, and why we need it.
Writing the scripts to ingest data: Based on the answers to the previous step, we can begin planning how our code will behave and some basic steps.
Storing data in a temporary database or other types of storage: Between the ingest and the transformation phase, data is typically stored in a temporary database or repository.

Figure 1.19 – Data governance pillars

How to do it…

Step by step, let’s attribute the pillars in Figure 1.19 to the ingestion phase:

A concern for accessibility needs to be applied at the data source level, defining the individuals that are allowed to see or retrieve data.
Next, it is necessary to catalog our data to understand it better. Since data ingestion is only covered here, it is more relevant to cover the data sources.
The quality pillar will be applied to the ingestion and staging area, where we control the data and keep its quality aligned with the source.
Then, let’s define ownership. We know the data source belongs to a business area or a company. However, when we ingested the data and put it in temporary or staging storage, it becomes our responsibility to maintain it.
The last pillar involves keeping data secure for the whole pipeline. Security is vital in all steps, since we may be handling private or sensitive information.

Figure 1.20 – Adding to data ingestion

How it works…

While some articles define “pillars” to create governance good practices, the best way to understand how to apply them is to understand how they are composed. As you saw in the previous How to do it… section, we attributed some items to our pipeline, and now we can understand how they are connected to the following topics:

Data accessibility: Data accessibility is how people from a group, organization, or project can see and use data. The information needs to be readily available for use. At the same time, it needs to be available for the people involved in the process. For example, sensitive data accessibility should be restricted to some people or programs. In the diagram we built, we applied it to our data sources, since we need to understand and retrieve data. For the same reason, it can be applied for temporary storage needs as well.
Data catalog: Cataloging and documenting data are essential for business and engineering teams. When we know what types of information rely on our databases or data lakes and have quick access to these documents, the action time to solve a problem becomes short.

Again, documenting our data sources can make the ingest process quicker, since we need to make a discovery every time we need to ingest data.

Data quality: Quality is constantly preoccupied with ingesting, processing, and loading data. Tracking and monitoring data’s expected income and outcome by its periodicity is essential. For example, if we expect to ingest 300 GB of data per day and suddenly it drops to 1 GB, something is very wrong and will affect the quality of our final output. Other quality parameters can be the number of columns, partitioning, and so on, which we will explore later in this book.
Ownership: Who is responsible for the data? This definition is crucial to make contact with the owner if there are problems or attribute responsibility to keep and maintain data.
Security: A concerning topic nowadays is data security. With so many regulations about data privacy, it became an obligation of data engineers and scientists to know, at least, the basics of encryption, sensitive data, and how to avoid data leaks. Even languages and libraries that are used for work need to be evaluated. That’s why this item is attributed to the three steps in Figure 1.19.

In addition to the topics we explored, a global data governance project has a vital role called a data steward, which is responsible for managing an organization’s data assets and ensuring that data is accurate, consistent, and secure. In summary, data stewardship is managing and overseeing an organization’s data assets.

Data Ingestion with Python Cookbook

By : Gláucia Esppenchutz

Data Ingestion with Python Cookbook

By: Gláucia Esppenchutz

Overview of this book

Related Content you might be interested in

Current Title:

Data Ingestion with Python Cookbook

Data Engineering with Python

Building ETL Pipelines with Python

Applying data governance in ingestion

Getting ready

How to do it…

How it works…

See also