Book Image

Data Engineering with AWS - Second Edition

By : Gareth Eagar

5 (1)

Book Image

Data Engineering with AWS - Second Edition

5 (1)

By: Gareth Eagar

Overview of this book

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability. You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS. By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!

Preface

Who this book is for

What this book covers

To get the most out of this book

Section 1: AWS Data Engineering Concepts and Trends

Section 1: AWS Data Engineering Concepts and Trends

Free Chapter

An Introduction to Data Engineering

An Introduction to Data Engineering

Technical requirements

The rise of big data as a corporate asset

The challenges of ever-growing datasets

The role of the data engineer as a big data enabler

The benefits of the cloud when building big data analytic solutions

Hands-on – creating and accessing your AWS account

Data Management Architectures for Analytics

Data Management Architectures for Analytics

Technical requirements

The evolution of data management for analytics

A deeper dive into data warehouse concepts and architecture

An overview of data lake architecture and concepts

Bringing together the best of data warehouses and data lakes

Hands-on – using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets

The AWS Data Engineer’s Toolkit

The AWS Data Engineer’s Toolkit

Technical requirements

An overview of AWS services for ingesting data

An overview of AWS services for transforming data

An overview of AWS services for orchestrating big data pipelines

An overview of AWS services for consuming data

Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket

Data Governance, Security, and Cataloging

Data Governance, Security, and Cataloging

Technical requirements

The many different aspects of data governance

Data security, access, and privacy

Data quality, data profiling, and data lineage

Business and technical data catalogs

AWS services that help with data governance

Hands-on – configuring Lake Formation permissions

Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations

Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations

Architecting Data Engineering Pipelines

Architecting Data Engineering Pipelines

Technical requirements

Approaching the data pipeline architecture

Identifying data consumers and understanding their requirements

Identifying data sources and ingesting data

Identifying data transformations and optimizations

Loading data into data marts

Wrapping up the whiteboarding session

Hands-on – architecting a sample pipeline

Ingesting Batch and Streaming Data

Ingesting Batch and Streaming Data

Technical requirements

Understanding data sources

Ingesting data from a relational database

Ingesting streaming data

Hands-on – ingesting data with AWS DMS

Hands-on – ingesting streaming data

Transforming Data to Optimize for Analytics

Transforming Data to Optimize for Analytics

Technical requirements

Overview of how transformations can create value

Types of data transformation tools

Common data preparation transformations

Common business use case transformations

Working with Change Data Capture (CDC) data

Hands-on – joining datasets with AWS Glue Studio

Identifying and Enabling Data Consumers

Identifying and Enabling Data Consumers

Technical requirements

Understanding the impact of data democratization

Meeting the needs of business users with data visualization

Meeting the needs of data analysts with structured reporting

Meeting the needs of data scientists and ML models

Hands-on – creating data transformations with AWS Glue DataBrew

A Deeper Dive into Data Marts and Amazon Redshift

A Deeper Dive into Data Marts and Amazon Redshift

Technical requirements

Extending analytics with data warehouses/data marts

What not to do – anti-patterns for a data warehouse

Redshift architecture review and storage deep dive

Designing a high-performance data warehouse

Moving data between a data lake and Redshift

Exploring advanced Redshift features

Hands-on – deploying a Redshift Serverless cluster and running Redshift Spectrum queries

Orchestrating the Data Pipeline

Orchestrating the Data Pipeline

Technical requirements

Understanding the core concepts for pipeline orchestration

Examining the options for orchestrating pipelines in AWS

Hands-on – orchestrating a data pipeline using AWS Step Functions

Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

Ad Hoc Queries with Amazon Athena

Ad Hoc Queries with Amazon Athena

Technical requirements

An introduction to Amazon Athena

Tips and tricks to optimize Amazon Athena queries

Exploring advanced Athena functionality

Managing groups of users with Amazon Athena workgroups

Hands-on – creating an Amazon Athena workgroup and configuring Athena settings

Hands-on – switching workgroups and running queries

Visualizing Data with Amazon QuickSight

Visualizing Data with Amazon QuickSight

Technical requirements

Representing data visually for maximum impact

Understanding Amazon QuickSight’s core concepts

Ingesting and preparing data from a variety of sources

Creating and sharing visuals with QuickSight analyses and dashboards

Understanding QuickSight’s advanced features

Hands-on – creating a simple QuickSight visualization

Enabling Artificial Intelligence and Machine Learning

Enabling Artificial Intelligence and Machine Learning

Technical requirements

Understanding the value of AI and ML for organizations

Exploring AWS services for ML

Exploring AWS services for AI

Building generative AI solutions on AWS

Common use cases for LLMs

Hands-on – reviewing reviews with Amazon Comprehend

Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World

Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World

Building Transactional Data Lakes

Building Transactional Data Lakes

Technical requirements

What does it mean for a data lake to be transactional?

An overview of Delta Lake, Apache Hudi, and Apache Iceberg

AWS service integrations for building transactional data lakes

Hands-on – Working with Apache Iceberg tables in AWS

Implementing a Data Mesh Strategy

Implementing a Data Mesh Strategy

Technical requirements

What is a data mesh?

Challenges that a data mesh approach attempts to resolve

The organizational and technical challenges of building a data mesh

AWS services that help enable a data mesh approach

A sample architecture for a data mesh on AWS

Hands-on – Setting up Amazon DataZone

Building a Modern Data Platform on AWS

Building a Modern Data Platform on AWS

Technical requirements

Goals of a modern data platform

Deciding whether to build or buy a data platform

DataOps as an approach to building data platforms

Hands-on – automated deployment of data platform components and data transformation code

Wrapping Up the First Part of Your Learning Journey

Wrapping Up the First Part of Your Learning Journey

Technical requirements

Understanding the complexities of real-world data environments

Examining examples of real-world data pipelines

Imagining the future – a look at emerging trends

Hands-on – cleaning up your AWS account

Other Books You May Enjoy

Other Books You May Enjoy

Index

Customer Reviews

5 (1)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

To get the most out of this book

Basic knowledge of computer systems and concepts, and how these are used within large organizations, is helpful prerequisite knowledge for this book. However, no data engineering-specific skills or knowledge are required. Also, a familiarity with cloud computing fundamentals and core AWS systems will make it easier to follow along, especially with the hands-on exercises, but detailed step-by-step instructions are included for each task.

Note:

If you are using the digital version of this book, we advise you to access the code from the book’s GitHub repository (a link is available in the next section), rather than copying and pasting from the PDF or electronic version. Doing so will help you avoid any potential formatting errors when copying and pasting code.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781804614426.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: “Include a WHERE Year = 2020 clause.”

A block of code is set as follows:

datalake_bucket/year=2023/file1.parquet 
datalake_bucket/year=2022/file1.parquet 
datalake_bucket/year=2021/file1.parquet 
datalake_bucket/year=2020/file1.parquet

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

datalake_bucket/year=2023/file1.parquet
datalake_bucket/year=2022/file1.parquet
datalake_bucket/year=2021/file1.parquet
datalake_bucket/year=2020/file1.parquet

Bold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes appear in the text like this. For example: “In addition, you can use Spark SQL to process data using standard SQL.”

Warnings or important notes appear like this.

Tips and tricks appear like this.