Book Image

Data Ingestion with Python Cookbook

By : Gláucia Esppenchutz
Book Image

Data Ingestion with Python Cookbook

By: Gláucia Esppenchutz

Overview of this book

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges. You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation. By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.
Table of Contents (17 chapters)
1
Part 1: Fundamentals of Data Ingestion
9
Part 2: Structuring the Ingestion Pipeline

Importing unstructured data without a schema

As seen before, unstructured data or NoSQL is a group of information that does not follow a format, such as relational or tabular data. It can be presented as an image, video, metadata, transcripts, and so on. The data ingestion process usually involves a JSON file or a document collection, as we previously saw when ingesting data from MongoDB.

In this recipe, we will read a JSON file and transform it into a DataFrame without a schema. Although unstructured data is supposed to have a more flexible design, we will see some implications of not having any schema or structure in our DataFrame.

Getting ready…

Here, we will use the holiday_brazil.json file to create the DataFrame. You can find it in the GitHub repository here: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

We will use SparkSession to read the JSON file and create a DataFrame to ensure the session is up and running.

All code can be...