Data Engineering with AWS - Second Edition

By : Gareth Eagar

5 (1)

Buy this Book

Data Engineering with AWS - Second Edition

5 (1)

By: Gareth Eagar

Buy this Book

Overview of this book

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability. You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS. By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Section 1: AWS Data Engineering Concepts and Trends

Free Chapter

An Introduction to Data Engineering

Technical requirements

The rise of big data as a corporate asset

The challenges of ever-growing datasets

The role of the data engineer as a big data enabler

The benefits of the cloud when building big data analytic solutions

Hands-on – creating and accessing your AWS account

Summary

Data Management Architectures for Analytics

Technical requirements

The evolution of data management for analytics

A deeper dive into data warehouse concepts and architecture

An overview of data lake architecture and concepts

Bringing together the best of data warehouses and data lakes

Hands-on – using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets

Summary

The AWS Data Engineer’s Toolkit

Technical requirements

An overview of AWS services for ingesting data

An overview of AWS services for transforming data

An overview of AWS services for orchestrating big data pipelines

An overview of AWS services for consuming data

Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket

Summary

Data Governance, Security, and Cataloging

Technical requirements

The many different aspects of data governance

Data security, access, and privacy

Data quality, data profiling, and data lineage

Business and technical data catalogs

AWS services that help with data governance

Hands-on – configuring Lake Formation permissions

Summary

Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations

Architecting Data Engineering Pipelines

Technical requirements

Approaching the data pipeline architecture

Identifying data consumers and understanding their requirements

Identifying data sources and ingesting data

Identifying data transformations and optimizations

Loading data into data marts

Wrapping up the whiteboarding session

Hands-on – architecting a sample pipeline

Summary

Ingesting Batch and Streaming Data

Technical requirements

Understanding data sources

Ingesting data from a relational database

Ingesting streaming data

Hands-on – ingesting data with AWS DMS

Hands-on – ingesting streaming data

Summary

Transforming Data to Optimize for Analytics

Technical requirements

Overview of how transformations can create value

Types of data transformation tools

Common data preparation transformations

Common business use case transformations

Working with Change Data Capture (CDC) data

Hands-on – joining datasets with AWS Glue Studio

Summary

Identifying and Enabling Data Consumers

Technical requirements

Understanding the impact of data democratization

Meeting the needs of business users with data visualization

Meeting the needs of data analysts with structured reporting

Meeting the needs of data scientists and ML models

Hands-on – creating data transformations with AWS Glue DataBrew

Summary

A Deeper Dive into Data Marts and Amazon Redshift

Technical requirements

Extending analytics with data warehouses/data marts

What not to do – anti-patterns for a data warehouse

Redshift architecture review and storage deep dive

Designing a high-performance data warehouse

Moving data between a data lake and Redshift

Exploring advanced Redshift features

Hands-on – deploying a Redshift Serverless cluster and running Redshift Spectrum queries

Summary

Orchestrating the Data Pipeline

Technical requirements

Understanding the core concepts for pipeline orchestration

Examining the options for orchestrating pipelines in AWS

Hands-on – orchestrating a data pipeline using AWS Step Functions

Summary

Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

Ad Hoc Queries with Amazon Athena

Technical requirements

An introduction to Amazon Athena

Tips and tricks to optimize Amazon Athena queries

Exploring advanced Athena functionality

Managing groups of users with Amazon Athena workgroups

Hands-on – creating an Amazon Athena workgroup and configuring Athena settings

Hands-on – switching workgroups and running queries

Summary

Visualizing Data with Amazon QuickSight

Technical requirements

Representing data visually for maximum impact

Understanding Amazon QuickSight’s core concepts

Ingesting and preparing data from a variety of sources

Creating and sharing visuals with QuickSight analyses and dashboards

Understanding QuickSight’s advanced features

Hands-on – creating a simple QuickSight visualization

Summary

Enabling Artificial Intelligence and Machine Learning

Technical requirements

Understanding the value of AI and ML for organizations

Exploring AWS services for ML

Exploring AWS services for AI

Building generative AI solutions on AWS

Common use cases for LLMs

Hands-on – reviewing reviews with Amazon Comprehend

Summary

Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World

Building Transactional Data Lakes

Technical requirements

What does it mean for a data lake to be transactional?

An overview of Delta Lake, Apache Hudi, and Apache Iceberg

AWS service integrations for building transactional data lakes

Hands-on – Working with Apache Iceberg tables in AWS

Summary

Implementing a Data Mesh Strategy

Technical requirements

What is a data mesh?

Challenges that a data mesh approach attempts to resolve

The organizational and technical challenges of building a data mesh

AWS services that help enable a data mesh approach

A sample architecture for a data mesh on AWS

Hands-on – Setting up Amazon DataZone

Summary

Building a Modern Data Platform on AWS

Technical requirements

Goals of a modern data platform

Deciding whether to build or buy a data platform

DataOps as an approach to building data platforms

Hands-on – automated deployment of data platform components and data transformation code

Summary

Wrapping Up the First Part of Your Learning Journey

Technical requirements

Understanding the complexities of real-world data environments

Examining examples of real-world data pipelines

Imagining the future – a look at emerging trends

Hands-on – cleaning up your AWS account

Summary

Other Books You May Enjoy

Index

Customer Reviews

5 (1)

5 star

100%

4 star

3 star

2 star

1 star

What this book covers

Each of the chapters in this book takes the approach of introducing important concepts or key AWS services, and then providing a hands-on exercise related to the topic of the chapter:

Chapter 1, An Introduction to Data Engineering, reviews the challenges of ever-increasing dataset volumes, and the role of the data engineer in working with data in the cloud.

Chapter 2, Data Management Architectures for Analytics, introduces foundational concepts and technologies related to big data processing.

Chapter 3, The AWS Data Engineer’s Toolkit, provides an introduction to a wide range of AWS services that are used for ingesting, processing, and consuming data, and orchestrating pipelines.

Chapter 4, Data Governance, Security, and Cataloging, covers the all-important topics of keeping data secure, ensuring good data governance, and the importance of cataloging your data.

Chapter 5, Architecting Data Engineering Pipelines, provides an approach for whiteboarding the high-level design of a data engineering pipeline.

Chapter 6, Ingesting Batch and Streaming Data, looks at the variety of data sources that we may need to ingest from, and examines AWS services for ingesting both batch and streaming data.

Chapter 7, Transforming Data to Optimize for Analytics, covers common transformations for optimizing datasets and for applying business logic.

Chapter 8, Identifying and Enabling Data Consumers, is about better understanding the different types of data consumers that a data engineer may work to prepare data for.

Chapter 9, A Deeper Dive into Data Marts and Amazon Redshift, focuses on the use of data warehouses as a data mart and looks at moving data between a data lake and data warehouse. This chapter also does a deep dive into Amazon Redshift, a cloud-based data warehouse.

Chapter 10, Orchestrating the Data Pipeline, looks at how various data engineering tasks and transformations can be put together in a data pipeline, and how these can be run and managed with pipeline orchestration tools such as AWS Step Functions.

Chapter 11, Ad Hoc Queries with Amazon Athena, does a deeper dive into the Amazon Athena service, which can be used to run SQL queries directly on data in the data lake, and beyond.

Chapter 12, Visualizing Data with Amazon QuickSight, discusses the importance of being able to craft visualizations of data, and how the Amazon QuickSight service enables this.

Chapter 13, Enabling Artificial Intelligence and Machine Learning, reviews how AI and ML are increasingly important for gaining new value from data, and introduces some of the AWS services for both ML and AI.

Chapter 14, Building Transactional Data Lakes, looks at new table formats (including Apache Iceberg, Apache Hudi, and Delta Lake) that bring traditional data warehousing type features to data lakes.

Chapter 15, Implementing a Data Mesh Strategy, discusses a recent trend, referred to as a data mesh, that provides a new way to approach analytical data management and data sharing within an organization.

Chapter 16, Building a Modern Data Platform on AWS, introduces important concepts, such as DataOps, which provides automation and observability when building a modern data platform.

Chapter 17, Wrapping Up the First Part of Your Learning Journey, concludes the book by looking at the bigger picture of data analytics, including real-world examples of data pipelines, and a review of emerging trends in the industry.

Data Engineering with AWS - Second Edition

By : Gareth Eagar

Data Engineering with AWS - Second Edition

By: Gareth Eagar

Overview of this book

Related Content you might be interested in

Current Title:

Data Engineering with AWS - Second Edition

Modern Data Architecture on AWS

Data Wrangling on AWS

Serverless Analytics with Amazon Athena

What this book covers