Book Image

Data Engineering with AWS - Second Edition

By : Gareth Eagar

5 (1)

Book Image

Data Engineering with AWS - Second Edition

5 (1)

By: Gareth Eagar

Overview of this book

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability. You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS. By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!

Preface

Who this book is for

What this book covers

To get the most out of this book

Section 1: AWS Data Engineering Concepts and Trends

Section 1: AWS Data Engineering Concepts and Trends

Free Chapter

An Introduction to Data Engineering

An Introduction to Data Engineering

Technical requirements

The rise of big data as a corporate asset

The challenges of ever-growing datasets

The role of the data engineer as a big data enabler

The benefits of the cloud when building big data analytic solutions

Hands-on – creating and accessing your AWS account

Data Management Architectures for Analytics

Data Management Architectures for Analytics

Technical requirements

The evolution of data management for analytics

A deeper dive into data warehouse concepts and architecture

An overview of data lake architecture and concepts

Bringing together the best of data warehouses and data lakes

Hands-on – using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets

The AWS Data Engineer’s Toolkit

The AWS Data Engineer’s Toolkit

Technical requirements

An overview of AWS services for ingesting data

An overview of AWS services for transforming data

An overview of AWS services for orchestrating big data pipelines

An overview of AWS services for consuming data

Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket

Data Governance, Security, and Cataloging

Data Governance, Security, and Cataloging

Technical requirements

The many different aspects of data governance

Data security, access, and privacy

Data quality, data profiling, and data lineage

Business and technical data catalogs

AWS services that help with data governance

Hands-on – configuring Lake Formation permissions

Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations

Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations

Architecting Data Engineering Pipelines

Architecting Data Engineering Pipelines

Technical requirements

Approaching the data pipeline architecture

Identifying data consumers and understanding their requirements

Identifying data sources and ingesting data

Identifying data transformations and optimizations

Loading data into data marts

Wrapping up the whiteboarding session

Hands-on – architecting a sample pipeline

Ingesting Batch and Streaming Data

Ingesting Batch and Streaming Data

Technical requirements

Understanding data sources

Ingesting data from a relational database

Ingesting streaming data

Hands-on – ingesting data with AWS DMS

Hands-on – ingesting streaming data

Transforming Data to Optimize for Analytics

Transforming Data to Optimize for Analytics

Technical requirements

Overview of how transformations can create value

Types of data transformation tools

Common data preparation transformations

Common business use case transformations

Working with Change Data Capture (CDC) data

Hands-on – joining datasets with AWS Glue Studio

Identifying and Enabling Data Consumers

Identifying and Enabling Data Consumers

Technical requirements

Understanding the impact of data democratization

Meeting the needs of business users with data visualization

Meeting the needs of data analysts with structured reporting

Meeting the needs of data scientists and ML models

Hands-on – creating data transformations with AWS Glue DataBrew

A Deeper Dive into Data Marts and Amazon Redshift

A Deeper Dive into Data Marts and Amazon Redshift

Technical requirements

Extending analytics with data warehouses/data marts

What not to do – anti-patterns for a data warehouse

Redshift architecture review and storage deep dive

Designing a high-performance data warehouse

Moving data between a data lake and Redshift

Exploring advanced Redshift features

Hands-on – deploying a Redshift Serverless cluster and running Redshift Spectrum queries

Orchestrating the Data Pipeline

Orchestrating the Data Pipeline

Technical requirements

Understanding the core concepts for pipeline orchestration

Examining the options for orchestrating pipelines in AWS

Hands-on – orchestrating a data pipeline using AWS Step Functions

Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

Ad Hoc Queries with Amazon Athena

Ad Hoc Queries with Amazon Athena

Technical requirements

An introduction to Amazon Athena

Tips and tricks to optimize Amazon Athena queries

Exploring advanced Athena functionality

Managing groups of users with Amazon Athena workgroups

Hands-on – creating an Amazon Athena workgroup and configuring Athena settings

Hands-on – switching workgroups and running queries

Visualizing Data with Amazon QuickSight

Visualizing Data with Amazon QuickSight

Technical requirements

Representing data visually for maximum impact

Understanding Amazon QuickSight’s core concepts

Ingesting and preparing data from a variety of sources

Creating and sharing visuals with QuickSight analyses and dashboards

Understanding QuickSight’s advanced features

Hands-on – creating a simple QuickSight visualization

Enabling Artificial Intelligence and Machine Learning

Enabling Artificial Intelligence and Machine Learning

Technical requirements

Understanding the value of AI and ML for organizations

Exploring AWS services for ML

Exploring AWS services for AI

Building generative AI solutions on AWS

Common use cases for LLMs

Hands-on – reviewing reviews with Amazon Comprehend

Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World

Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World

Building Transactional Data Lakes

Building Transactional Data Lakes

Technical requirements

What does it mean for a data lake to be transactional?

An overview of Delta Lake, Apache Hudi, and Apache Iceberg

AWS service integrations for building transactional data lakes

Hands-on – Working with Apache Iceberg tables in AWS

Implementing a Data Mesh Strategy

Implementing a Data Mesh Strategy

Technical requirements

What is a data mesh?

Challenges that a data mesh approach attempts to resolve

The organizational and technical challenges of building a data mesh

AWS services that help enable a data mesh approach

A sample architecture for a data mesh on AWS

Hands-on – Setting up Amazon DataZone

Building a Modern Data Platform on AWS

Building a Modern Data Platform on AWS

Technical requirements

Goals of a modern data platform

Deciding whether to build or buy a data platform

DataOps as an approach to building data platforms

Hands-on – automated deployment of data platform components and data transformation code

Wrapping Up the First Part of Your Learning Journey

Wrapping Up the First Part of Your Learning Journey

Technical requirements

Understanding the complexities of real-world data environments

Examining examples of real-world data pipelines

Imagining the future – a look at emerging trends

Hands-on – cleaning up your AWS account

Other Books You May Enjoy

Other Books You May Enjoy

Index

Customer Reviews

5 (1)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

Common data preparation transformations

The first set of transformations that we look at are those that help prepare the data for further transformations later in the pipeline. These transformations are designed to apply relatively generic optimizations to individual datasets that we are ingesting into the data lake. For these optimizations, you may need some understanding of the source data system and context, but, generally, you do not need to understand the ultimate business use case for the dataset.

Protecting PII data

Often, datasets that we ingest may contain personally identifiable information (PII) data, and there may be governance restrictions on which PII data can be stored in the data lake. As a result, we need to have a process that protects the PII data as soon as possible after it is ingested.

There are a number of common approaches that can be used here (such as tokenization or hashing), each with its own advantages and disadvantages, as we discussed in...