Book Image

Data Engineering with Python

By : Paul Crickard

Book Image

Data Engineering with Python

By: Paul Crickard

Overview of this book

Data engineering provides the foundation for data science and analytics, and forms an important part of all businesses. This book will help you to explore various tools and methods that are used for understanding the data engineering process using Python. The book will show you how to tackle challenges commonly faced in different aspects of data engineering. You’ll start with an introduction to the basics of data engineering, along with the technologies and frameworks required to build data pipelines to work with large datasets. You’ll learn how to transform and clean data and perform analytics to get the most out of your data. As you advance, you'll discover how to work with big data of varying complexity and production databases, and build data pipelines. Using real-world examples, you’ll build architectures on which you’ll learn how to deploy data pipelines. By the end of this Python book, you’ll have gained a clear understanding of data modeling techniques, and will be able to confidently build data engineering pipelines for tracking data, running quality checks, and making necessary changes in production.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Section 1: Building Data Pipelines – Extract Transform, and Load

Section 1: Building Data Pipelines – Extract Transform, and Load

Free Chapter

Chapter 1: What is Data Engineering?

Chapter 1: What is Data Engineering?

What data engineers do

Data engineering versus data science

Data engineering tools

Chapter 2: Building Our Data Engineering Infrastructure

Chapter 2: Building Our Data Engineering Infrastructure

Installing and configuring Apache NiFi

Installing and configuring Apache Airflow

Installing and configuring Elasticsearch

Installing and configuring Kibana

Installing and configuring PostgreSQL

Installing pgAdmin 4

Chapter 3: Reading and Writing Files

Chapter 3: Reading and Writing Files

Writing and reading files in Python

Building data pipelines in Apache Airflow

Handling files using NiFi processors

Chapter 4: Working with Databases

Chapter 4: Working with Databases

Inserting and extracting relational data in Python

Inserting and extracting NoSQL database data in Python

Building data pipelines in Apache Airflow

Handling databases with NiFi processors

Chapter 5: Cleaning, Transforming, and Enriching Data

Chapter 5: Cleaning, Transforming, and Enriching Data

Performing exploratory data analysis in Python

Handling common data issues using pandas

Cleaning data using Airflow

Chapter 6: Building a 311 Data Pipeline

Chapter 6: Building a 311 Data Pipeline

Building the data pipeline

Building a Kibana dashboard

Section 2:Deploying Data Pipelines in Production

Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline

Chapter 7: Features of a Production Pipeline

Staging and validating data

Building idempotent data pipelines

Building atomic data pipelines

Chapter 8: Version Control with the NiFi Registry

Chapter 8: Version Control with the NiFi Registry

Installing and configuring the NiFi Registry

Using the Registry in NiFi

Versioning your data pipelines

Using git-persistence with the NiFi Registry

Chapter 9: Monitoring Data Pipelines

Chapter 9: Monitoring Data Pipelines

Monitoring NiFi using the GUI

Monitoring NiFi with processors

Using Python with the NiFi REST API

Chapter 10: Deploying Data Pipelines

Chapter 10: Deploying Data Pipelines

Finalizing your data pipelines for production

Using the NiFi variable registry

Deploying your data pipelines

Chapter 11: Building a Production Data Pipeline

Chapter 11: Building a Production Data Pipeline

Creating a test and production environment

Building a production data pipeline

Deploying a data pipeline in production

Section 3:Beyond Batch – Building Real-Time Data Pipelines

Section 3:Beyond Batch – Building Real-Time Data Pipelines

Chapter 12: Building a Kafka Cluster

Chapter 12: Building a Kafka Cluster

Creating ZooKeeper and Kafka clusters

Testing the Kafka cluster

Chapter 13: Streaming Data with Apache Kafka

Chapter 13: Streaming Data with Apache Kafka

Understanding logs

Understanding how Kafka uses logs

Building data pipelines with Kafka and NiFi

Differentiating stream processing from batch processing

Producing and consuming with Python

Chapter 14: Data Processing with Apache Spark

Chapter 14: Data Processing with Apache Spark

Installing and running Spark

Installing and configuring PySpark

Processing data with PySpark

Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

Setting up MiNiFi

Building a MiNiFi task in NiFi

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Appendix

Building a NiFi cluster

The basics of NiFi clustering

Building a NiFi cluster

Building a distributed data pipeline

Managing the distributed data pipeline

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Installing and running Spark

Apache Spark is a distributed data processing engine that can handle both streams and batch data, and even graphs. It has a core set of components and other libraries that are used to add functionality. A common depiction of the Spark ecosystem is shown in the following diagram:

Figure 14.1 – The Apache Spark ecosystem

To run Spark as a cluster, you have several options. Spark can run in a standalone mode, which uses a simple cluster manager provided by Spark. It can also run on an Amazon EC2 instance, using YARN, Mesos, or Kubernetes. In a production environment with a significant workload, you would probably not want to run in standalone mode; however, this is how we will stand up our cluster in this chapter. The principles will be the same, but the standalone cluster provides the fastest way to get you up and running without needing to dive into more complicated infrastructure.

To install Apache Spark, take the following...