Data Engineering with AWS

By : Gareth Eagar

Data Engineering with AWS

By: Gareth Eagar

Overview of this book

Written by a Senior Data Architect with over twenty-five years of experience in the business, Data Engineering for AWS is a book whose sole aim is to make you proficient in using the AWS ecosystem. Using a thorough and hands-on approach to data, this book will give aspiring and new data engineers a solid theoretical and practical foundation to succeed with AWS. As you progress, you’ll be taken through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. You’ll also learn about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1: AWS Data Engineering Concepts and Trends

Free Chapter

Chapter 1: An Introduction to Data Engineering

Technical requirements

The rise of big data as a corporate asset

The challenges of ever-growing datasets

Data engineers – the big data enablers

The benefits of the cloud when building big data analytic solutions

Hands-on – creating and accessing your AWS account

Summary

Chapter 2: Data Management Architectures for Analytics

Technical requirements

The evolution of data management for analytics

Understanding data warehouses and data marts – fountains of truth

Building data lakes to tame the variety and volume of big data

Bringing together the best of both worlds with the lake house architecture

Hands-on – configuring the AWS Command Line Interface tool and creating an S3 bucket

Summary

Chapter 3: The AWS Data Engineer's Toolkit

Technical requirements

AWS services for ingesting data

AWS services for transforming data

AWS services for orchestrating big data pipelines

AWS services for consuming data

Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket

Summary

Chapter 4: Data Cataloging, Security, and Governance

Technical requirements

Getting data security and governance right

Cataloging your data to avoid the data swamp

The AWS Glue/Lake Formation data catalog

AWS services for data encryption and security monitoring

AWS services for managing identity and permissions

Hands-on – configuring Lake Formation permissions

Summary

Section 2: Architecting and Implementing Data Lakes and Data Lake Houses

Chapter 5: Architecting Data Engineering Pipelines

Technical requirements

Approaching the data pipeline architecture

Identifying data consumers and understanding their requirements

Identifying data sources and ingesting data

Identifying data transformations and optimizations

Loading data into data marts

Wrapping up the whiteboarding session

Hands-on – architecting a sample pipeline

Summary

Chapter 6: Ingesting Batch and Streaming Data

Technical requirements

Understanding data sources

Ingesting data from a relational database

Ingesting streaming data

Hands-on – ingesting data with AWS DMS

Hands-on – ingesting streaming data

Summary

Chapter 7: Transforming Data to Optimize for Analytics

Technical requirements

Transformations – making raw data more valuable

Types of data transformation tools

Data preparation transformations

Business use case transforms

Working with change data capture (CDC) data

Hands-on – joining datasets with AWS Glue Studio

Summary

Chapter 8: Identifying and Enabling Data Consumers

Technical requirements

Understanding the impact of data democratization

Meeting the needs of business users with data visualization

Meeting the needs of data analysts with structured reporting

Meeting the needs of data scientists and ML models

Hands-on – creating data transformations with AWS Glue DataBrew

Summary

Chapter 9: Loading Data into a Data Mart

Technical requirements

Extending analytics with data warehouses/data marts

What not to do – anti-patterns for a data warehouse

Redshift architecture review and storage deep dive

Designing a high-performance data warehouse

Moving data between a data lake and Redshift

Hands-on – loading data into an Amazon Redshift cluster and running queries

Summary

Chapter 10: Orchestrating the Data Pipeline

Technical requirements

Understanding the core concepts for pipeline orchestration

Examining the options for orchestrating pipelines in AWS

Hands-on – orchestrating a data pipeline using AWS Step Function

Summary

Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

Chapter 11: Ad Hoc Queries with Amazon Athena

Technical requirements

Amazon Athena – in-place SQL analytics for the data lake

Tips and tricks to optimize Amazon Athena queries

Federating the queries of external data sources with Amazon Athena Query Federation

Managing governance and costs with Amazon Athena Workgroups

Hands-on – creating an Amazon Athena workgroup and configuring Athena settings

Hands-on – switching Workgroups and running queries

Summary

Chapter 12: Visualizing Data with Amazon QuickSight

Technical requirements

Representing data visually for maximum impact

Understanding Amazon QuickSight's core concepts

Ingesting and preparing data from a variety of sources

Creating and sharing visuals with QuickSight analyses and dashboards

Understanding QuickSight's advanced features – ML Insights and embedded dashboards

Hands-on – creating a simple QuickSight visualization

Summary

Chapter 13: Enabling Artificial Intelligence and Machine Learning

Technical requirements

Understanding the value of ML and AI for organizations

Exploring AWS services for ML

Exploring AWS services for AI

Hands-on – reviewing reviews with Amazon Comprehend

Summary

Designing a high-performance data warehouse

When you're looking to design a high-performing data warehouse, multiple factors need to be considered. These include items such as cluster type and sizing, compression types, distribution keys, sort keys, data types, and table constraints.

As part of the design process, you will need to consider several trade-offs, such as cost verse performance or the size of storage verse performance. Business requirements and the available budget will often drive these decisions.

Beyond decisions about infrastructure and storage, the logical schema design also plays a big part in optimizing the performance of the data warehouse. Often, this will be an iterative process, where you start with an initial schema design that you refine over time to optimize for increased performance.

Selecting the optimal Redshift node type

There are different types of nodes available, each with different combinations of CPU, memory, storage capacity, and...

Data Engineering with AWS

By : Gareth Eagar

Data Engineering with AWS

By: Gareth Eagar

Overview of this book

Related Content you might be interested in

Current Title:

Data Engineering with AWS

Modern Data Architecture on AWS

Data Wrangling on AWS

Serverless Analytics with Amazon Athena

Designing a high-performance data warehouse

Selecting the optimal Redshift node type