Chapter 14: Data Processing with Apache Spark | Data Engineering with Python

Book Overview & Buying
Table Of Contents

Data Engineering with Python

By : Paul Crickard

2.6 (24)

Buy this Book

Data Engineering with Python

2.6 (24)

By: Paul Crickard

Buy this Book

Overview of this book

Data engineering provides the foundation for data science and analytics, and forms an important part of all businesses. This book will help you to explore various tools and methods that are used for understanding the data engineering process using Python. The book will show you how to tackle challenges commonly faced in different aspects of data engineering. You’ll start with an introduction to the basics of data engineering, along with the technologies and frameworks required to build data pipelines to work with large datasets. You’ll learn how to transform and clean data and perform analytics to get the most out of your data. As you advance, you'll discover how to work with big data of varying complexity and production databases, and build data pipelines. Using real-world examples, you’ll build architectures on which you’ll learn how to deploy data pipelines. By the end of this Python book, you’ll have gained a clear understanding of data modeling techniques, and will be able to confidently build data engineering pipelines for tracking data, running quality checks, and making necessary changes in production.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Building Data Pipelines – Extract Transform, and Load

Free Chapter

Chapter 1: What is Data Engineering?

What data engineers do

Data engineering versus data science

Data engineering tools

Summary

Chapter 2: Building Our Data Engineering Infrastructure

Installing and configuring Apache NiFi

Installing and configuring Apache Airflow

Installing and configuring Elasticsearch

Installing and configuring Kibana

Installing and configuring PostgreSQL

Installing pgAdmin 4

Summary

Chapter 3: Reading and Writing Files

Writing and reading files in Python

Building data pipelines in Apache Airflow

Handling files using NiFi processors

Summary

Chapter 4: Working with Databases

Inserting and extracting relational data in Python

Inserting and extracting NoSQL database data in Python

Building data pipelines in Apache Airflow

Handling databases with NiFi processors

Summary

Chapter 5: Cleaning, Transforming, and Enriching Data

Performing exploratory data analysis in Python

Handling common data issues using pandas

Cleaning data using Airflow

Summary

Chapter 6: Building a 311 Data Pipeline

Building the data pipeline

Building a Kibana dashboard

Summary

Section 2:Deploying Data Pipelines in Production

Chapter 7: Features of a Production Pipeline

Staging and validating data

Building idempotent data pipelines

Building atomic data pipelines

Summary

Chapter 8: Version Control with the NiFi Registry

Installing and configuring the NiFi Registry

Using the Registry in NiFi

Versioning your data pipelines

Using git-persistence with the NiFi Registry

Summary

Chapter 9: Monitoring Data Pipelines

Monitoring NiFi using the GUI

Monitoring NiFi with processors

Using Python with the NiFi REST API

Summary

Chapter 10: Deploying Data Pipelines

Finalizing your data pipelines for production

Using the NiFi variable registry

Deploying your data pipelines

Summary

Chapter 11: Building a Production Data Pipeline

Creating a test and production environment

Building a production data pipeline

Deploying a data pipeline in production

Summary

Section 3:Beyond Batch – Building Real-Time Data Pipelines

Chapter 12: Building a Kafka Cluster

Creating ZooKeeper and Kafka clusters

Testing the Kafka cluster

Summary

Chapter 13: Streaming Data with Apache Kafka

Understanding logs

Understanding how Kafka uses logs

Building data pipelines with Kafka and NiFi

Differentiating stream processing from batch processing

Producing and consuming with Python

Summary

Chapter 14: Data Processing with Apache Spark

Installing and running Spark

Installing and configuring PySpark

Processing data with PySpark

Summary

Chapter 15: Real-Time Edge Data with MiNiFi, Kafka, and Spark

Setting up MiNiFi

Building a MiNiFi task in NiFi

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Appendix

Building a NiFi cluster

The basics of NiFi clustering

Building a NiFi cluster

Building a distributed data pipeline

Managing the distributed data pipeline

Summary

Data Engineering with Python

By : Paul Crickard

Data Engineering with Python

By: Paul Crickard

Overview of this book

Summary

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access