Book Image

Data Ingestion with Python Cookbook

By : Gláucia Esppenchutz
Book Image

Data Ingestion with Python Cookbook

By: Gláucia Esppenchutz

Overview of this book

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges. You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation. By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.
Table of Contents (17 chapters)
1
Part 1: Fundamentals of Data Ingestion
9
Part 2: Structuring the Ingestion Pipeline

Creating schemas

Schemas are considered blueprints of a database or table. While some databases strictly require schema definition, others can work without it. However, in some cases, it is advantageous to work with data schemas to ensure that the application data architecture is maintained and can receive the desired data input.

Getting ready

Let’s imagine we need to create a database for a school to store information about the students, the courses, and the instructors. With this information, we know we have at least three tables so far.

Figure 1.13 – A table diagram for three entities

Figure 1.13 – A table diagram for three entities

In this recipe, we will cover how schemas work using the Entity Relationship Diagram (ERD), a visual representation of relationships between entities in a database, to exemplify how schemas are connected.

How to do it…

Here are the steps to try this:

  1. We define the type of schema. The following figure helps us understand how to go about this:
Figure 1.14 – A diagram to help you decide which schema to use

Figure 1.14 – A diagram to help you decide which schema to use

  1. Then, we define the fields and the data type for each table column:
Figure 1.15 – A definition of the columns of each table

Figure 1.15 – A definition of the columns of each table

  1. Next, we define which fields can be empty or NULL:
Figure 1.16 – A definition of which columns can be NULL

Figure 1.16 – A definition of which columns can be NULL

  1. Then, we create the relationship between the tables:
Figure 1.17 – A relationship diagram of the tables

Figure 1.17 – A relationship diagram of the tables

How it works…

When designing data schemas, the first thing we need to do is define their type. As we can see in the diagram in step 1, applying the schema architecture depends on the data’s purpose.

After that, the tables are designed. Deciding how to define data types can vary, depending project or purpose, but deciding what values a column can receive is important. For instance, the officeRoom on Teacher table can be an Integer type if we know the room’s identification is always numeric, or a String type if it is unsure how identifications are made (for example, Room 3-D).

Another important topic covered in step 3 is how to define which of the columns can accept NULL fields. Can a field for a student’s name be empty? If not, we need to create a constraint to forbid this type of insert.

Finally, based on the type of schema, a definition of the relationship between the tables is made.

See also

If you want to know more about database schema designs and their application, read this article by Mark Smallcombe: https://www.integrate.io/blog/database-schema-examples/.