Book Image

Data Ingestion with Python Cookbook

By : Gláucia Esppenchutz
Book Image

Data Ingestion with Python Cookbook

By: Gláucia Esppenchutz

Overview of this book

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges. You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation. By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.
Table of Contents (17 chapters)
1
Part 1: Fundamentals of Data Ingestion
9
Part 2: Structuring the Ingestion Pipeline

Configuring Airflow

Apache Airflow has many capabilities and a quick setup, which helps us start designing our workflows as code. Some additional configurations might be required as we progress with the workflows and into data processing. Gladly, Airflow has a dedicated file for inserting other arrangements without changing anything within its core.

In this recipe, we will learn more about the airflow.conf file, how to use it, and other valuable configurations required to execute the other recipes in this chapter. We will also cover where to find this file and how other environment variables work with this tool. Understanding these concepts in practice helps us to identify potential improvements or solve problems.

Getting ready

Before moving on to the code, ensure your Airflow runs correctly. You can do that by checking the Airflow UI at this link: http://localhost:8080.

If you are using a Docker container (as I am) to host your Airflow application, you can check its status...