Data Engineering with Scala and Spark

By : Eric Tome, Rupam Bhattacharjee, David Radford

Data Engineering with Scala and Spark

By: Eric Tome, Rupam Bhattacharjee, David Radford

Overview of this book

Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount. This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You’ll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You’ll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users. By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

Free Chapter

Chapter 1: Scala Essentials for Data Engineers

Technical requirements

Understanding functional programming

Understanding objects, classes, and traits

Working with higher-order functions (HOFs)

Understanding polymorphic functions

Variance

Option type

Understanding pattern matching

Implicits in Scala

Summary

Further reading

Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

Technical requirements

Working with Apache Spark

How do Spark applications work?

Creating a Spark application using Scala

Understanding the Spark Dataset API

Understanding the Spark DataFrame API

Summary

Chapter 4: Working with Databases

Technical requirements

Understanding the Spark JDBC API

Working with the Spark JDBC API

Loading the database configuration

Creating a database interface

Performing various database operations

Summary

Chapter 5: Object Stores and Data Lakes

Understanding distributed file systems

Streaming data

Working with streaming sources

Summary

Chapter 6: Understanding Data Transformation

Technical requirements

Understanding the difference between transformations and actions

Learning how to aggregate, group, and join data

Leveraging advanced window functions

Working with complex dataset types

Summary

Chapter 7: Data Profiling and Data Quality

Technical requirements

Understanding components of Deequ

Performing data analysis

Leveraging automatic constraint suggestion

Defining constraints

Storing metrics using MetricsRepository

Detecting anomalies

Summary

Part 3 – Software Engineering Best Practices for Data Engineering in Scala

Chapter 8: Test-Driven Development, Code Health, and Maintainability

Technical requirements

Introducing TDD

Running static code analysis

Understanding linting and code style

Summary

Chapter 9: CI/CD with GitHub

Technical requirements

Introducing CI/CD and GitHub

Working with GitHub

Understanding GitHub Actions

Summary

Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

Chapter 10: Data Pipeline Orchestration

Technical requirements

Understanding the basics of orchestration

Understanding core features of Apache Airflow

Working with Argo Workflows

Using Databricks Workflows

Leveraging Azure Data Factory

Summary

Chapter 11: Performance Tuning

Introducing the Spark UI

Leveraging the Spark UI for performance tuning

Right-sizing compute resources

Understanding data skewing, indexing, and partitioning

Summary

Part 5 – End-to-End Data Pipelines

Chapter 12: Building Batch Pipelines Using Spark and Scala

Understanding our business use case

Understanding the data

Understanding the medallion architecture

The end-to-end pipeline

Ingesting the data

Transforming the data

Checking data quality

Orchestrating our batch process

Summary

Chapter 13: Building Streaming Pipelines Using Spark and Scala

Understanding our business use case

What’s our IoT use case?

Ingesting the data

Transforming the data

Creating a serving layer

Orchestrating our streaming process

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

In this part, Chapter 10 delves into data pipeline orchestration, focusing on seamless task coordination and failure handling. It introduces tools such as Apache Airflow, Argo, Databricks Workflows, and Azure Data Factory. Chapter 11 highlights the Spark UI’s significance in performance optimization, covering the basics, tuning, resource optimization, and data handling techniques such as skewing, indexing, and partitioning.

This part has the following chapters:

Chapter 10, Data Pipeline Orchestration
Chapter 11, Performance Tuning

Data Engineering with Scala and Spark

By : Eric Tome, Rupam Bhattacharjee, David Radford

Data Engineering with Scala and Spark

By: Eric Tome, Rupam Bhattacharjee, David Radford

Overview of this book

Related Content you might be interested in

Current Title:

Data Engineering with Scala and Spark

Cracking the Data Engineering Interview

Optimizing Databricks Workloads

Professional Scala

Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning