Book Image

Data Engineering with Python

By : Paul Crickard

Book Image

Data Engineering with Python

By: Paul Crickard

Overview of this book

Data engineering provides the foundation for data science and analytics, and forms an important part of all businesses. This book will help you to explore various tools and methods that are used for understanding the data engineering process using Python. The book will show you how to tackle challenges commonly faced in different aspects of data engineering. You’ll start with an introduction to the basics of data engineering, along with the technologies and frameworks required to build data pipelines to work with large datasets. You’ll learn how to transform and clean data and perform analytics to get the most out of your data. As you advance, you'll discover how to work with big data of varying complexity and production databases, and build data pipelines. Using real-world examples, you’ll build architectures on which you’ll learn how to deploy data pipelines. By the end of this Python book, you’ll have gained a clear understanding of data modeling techniques, and will be able to confidently build data engineering pipelines for tracking data, running quality checks, and making necessary changes in production.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Section 1: Building Data Pipelines – Extract Transform, and Load

Section 1: Building Data Pipelines – Extract Transform, and Load

Free Chapter

Chapter 1: What is Data Engineering?

Chapter 1: What is Data Engineering?

What data engineers do

Data engineering versus data science

Data engineering tools

Chapter 2: Building Our Data Engineering Infrastructure

Chapter 2: Building Our Data Engineering Infrastructure

Installing and configuring Apache NiFi

Installing and configuring Apache Airflow

Installing and configuring Elasticsearch

Installing and configuring Kibana

Installing and configuring PostgreSQL

Installing pgAdmin 4

Chapter 3: Reading and Writing Files

Chapter 3: Reading and Writing Files

Writing and reading files in Python

Building data pipelines in Apache Airflow

Handling files using NiFi processors

Chapter 4: Working with Databases

Chapter 4: Working with Databases

Inserting and extracting relational data in Python

Inserting and extracting NoSQL database data in Python

Building data pipelines in Apache Airflow

Handling databases with NiFi processors

Chapter 5: Cleaning, Transforming, and Enriching Data

Chapter 5: Cleaning, Transforming, and Enriching Data

Performing exploratory data analysis in Python

Handling common data issues using pandas

Cleaning data using Airflow

Chapter 6: Building a 311 Data Pipeline

Chapter 6: Building a 311 Data Pipeline

Building the data pipeline

Building a Kibana dashboard

Section 2:Deploying Data Pipelines in Production

Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline

Chapter 7: Features of a Production Pipeline

Staging and validating data

Building idempotent data pipelines

Building atomic data pipelines

Chapter 8: Version Control with the NiFi Registry

Chapter 8: Version Control with the NiFi Registry

Installing and configuring the NiFi Registry

Using the Registry in NiFi

Versioning your data pipelines

Using git-persistence with the NiFi Registry

Chapter 9: Monitoring Data Pipelines

Chapter 9: Monitoring Data Pipelines

Monitoring NiFi using the GUI

Monitoring NiFi with processors

Using Python with the NiFi REST API

Chapter 10: Deploying Data Pipelines

Chapter 10: Deploying Data Pipelines

Finalizing your data pipelines for production

Using the NiFi variable registry

Deploying your data pipelines

Chapter 11: Building a Production Data Pipeline

Chapter 11: Building a Production Data Pipeline

Creating a test and production environment

Building a production data pipeline

Deploying a data pipeline in production

Section 3:Beyond Batch – Building Real-Time Data Pipelines

Section 3:Beyond Batch – Building Real-Time Data Pipelines

Chapter 12: Building a Kafka Cluster

Chapter 12: Building a Kafka Cluster

Creating ZooKeeper and Kafka clusters

Testing the Kafka cluster

Chapter 13: Streaming Data with Apache Kafka

Chapter 13: Streaming Data with Apache Kafka

Understanding logs

Understanding how Kafka uses logs

Building data pipelines with Kafka and NiFi

Differentiating stream processing from batch processing

Producing and consuming with Python

Chapter 14: Data Processing with Apache Spark

Chapter 14: Data Processing with Apache Spark

Installing and running Spark

Installing and configuring PySpark

Processing data with PySpark

Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

Setting up MiNiFi

Building a MiNiFi task in NiFi

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Appendix

Building a NiFi cluster

The basics of NiFi clustering

Building a NiFi cluster

Building a distributed data pipeline

Managing the distributed data pipeline

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Building a distributed data pipeline

Building a distributed data pipeline is almost exactly the same as building a data pipeline to run on a single machine. NiFi will handle the logistics of passing and recombining the data. A basic data pipeline is shown in the following screenshot:

Figure 16.4 – A basic data pipeline to generate data, extract attributes to json, and write to disk

The preceding data pipeline uses the GenerateFlowFile processor to create unique flowfiles. This is passed downstream to the AttributesToJSON processor, which extracts the attributes and writes to the flowfile content. Lastly, the file is written to disk at /home/paulcrickard/output.

Before running the data pipeline, you will need to make sure that you have the output directory for the PutFile processor on each node. Earlier, I said that data pipelines are no different when distributed, but there are some things you must keep in mind, one being that PutFile will write...