Book Image

Serverless ETL and Analytics with AWS Glue

By : Vishal Pathak, Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Albert Quiroga, Ishan Gaur
Book Image

Serverless ETL and Analytics with AWS Glue

By: Vishal Pathak, Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Albert Quiroga, Ishan Gaur

Overview of this book

Organizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes. Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You’ll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you’ll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options. By the end of this AWS book, you’ll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.
Table of Contents (20 chapters)
1
Section 1 – Introduction, Concepts, and the Basics of AWS Glue
5
Section 2 – Data Preparation, Management, and Security
13
Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases

Troubleshooting and debugging common issues in AWS Glue ETL

While AWS Glue makes it easy to implement data integration workloads using different components/microservices, depending on the user configuration and use case we may encounter a number of issues. In this section, we will discuss some common issues we may encounter while working with AWS Glue and different methods to solve those specific issues one by one.

ETL job failures

A Glue ETL job can fail for a number of reasons. Most job failures can be attributed to issues with configuration or resource provisioning, depending on the use case. Let’s explore some common issues we may come across while working with Glue ETL.

OOM errors

When working with a large volume of data, it is not uncommon for us to run into OOM errors. OOM errors can appear in both drivers and executors, depending on the use case. How we approach the issue largely depends on where exactly the issue is occurring, whether in the driver or the...