Data Wrangling on AWS

By : Navnit Shukla, Sankar M, Sampat Palani

5 (1)

Buy this Book

Data Wrangling on AWS

5 (1)

By: Navnit Shukla, Sankar M, Sampat Palani

Buy this Book

Overview of this book

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools. First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects. By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1:Unleashing Data Wrangling with AWS

Free Chapter

Chapter 1: Getting Started with Data Wrangling

Introducing data wrangling

The steps involved in data wrangling

Best practices for data wrangling

Options available for data wrangling on AWS

Summary

Part 2:Data Wrangling with AWS Tools

Chapter 2: Introduction to AWS Glue DataBrew

Why AWS Glue DataBrew?

Getting started with AWS Glue DataBrew

Using AWS Glue DataBrew for data wrangling

Data protection with AWS Glue DataBrew

Data lineage and data publication

Summary

Chapter 3: Introducing AWS SDK for pandas

AWS SDK for pandas

Building blocks of AWS SDK for pandas

Customizing, building, and installing AWS SDK for pandas for different use cases

Configuration options for AWS SDK for pandas

The features of AWS SDK for pandas with different AWS services

Summary

Chapter 4: Introduction to SageMaker Data Wrangler

Data import

Data orchestration

Data transformation

Insights and data quality

Data analysis

Data export

SageMaker Studio setup prerequisites

Summary

Part 3:AWS Data Management and Analysis

Chapter 5: Working with Amazon S3

Challenges and considerations when building a data lake on Amazon S3

Summary

Chapter 6: Working with AWS Glue

What is Apache Spark?

Data discovery with AWS Glue

Data ingestion using AWS Glue ETL

Summary

Chapter 7: Working with Athena

Understanding Amazon Athena

Advanced data discovery and data structuring with Athena

Enriching data from multiple sources using Athena

Setting up a serverless data quality pipeline with Athena

Summary

Chapter 8: Working with QuickSight

Introducing Amazon QuickSight and its concepts

Data discovery with QuickSight

Data visualization with QuickSight

Summary

Part 4:Advanced Data Manipulation and ML Data Optimization

Chapter 9: Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas

A solution walkthrough for sportstickets.com

Data quality validation

Data visualization

Summary

Chapter 10: Data Processing for Machine Learning with SageMaker Data Wrangler

Technical requirements

Step 1 – logging in to SageMaker Studio

Step 2 – importing data

Exploratory data analysis

Step 4 – adding transformations

Step 5 – exporting data

Training a machine learning model

Summary

Part 5:Ensuring Data Lake Security and Monitoring

Chapter 11: Data Lake Security and Monitoring

Data lake security

Monitoring and auditing

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 (1)

5 star

100%

4 star

3 star

2 star

1 star

What this book covers

Chapter 1, Getting Started with Data Wrangling: In the opening chapter, you will embark on a journey into the world of data wrangling and discover the power of leveraging AWS for efficient and effective data manipulation and preparation. This chapter serves as a solid foundation, providing you with an overview of the key concepts and tools you’ll encounter throughout the book.

Chapter 2, Introduction to AWS Glue DataBrew: In this chapter, you will discover the powerful capabilities of AWS Glue DataBrew for data wrangling and data preparation tasks. This chapter will guide you through the process of leveraging AWS Glue DataBrew to cleanse, transform, and enrich your data, ensuring its quality and usability for further analysis.

Chapter 3, Introducing AWS SDK for pandas: In this chapter, you will be introduced to the versatile capabilities of AWS Data Wrangler for data wrangling tasks on the AWS platform. This chapter will provide you with a comprehensive understanding of AWS Data Wrangler and how it can empower you to efficiently manipulate and prepare your data for analysis.

Chapter 4, Introduction to SageMaker Data Wrangler: In this chapter, you will discover the capabilities of Amazon SageMaker Data Wrangler for data wrangling tasks within the Amazon SageMaker ecosystem. This chapter will equip you with the knowledge and skills to leverage Amazon SageMaker Data Wrangler’s powerful features to efficiently preprocess and prepare your data for machine learning projects.

Chapter 5, Working with Amazon S3: In this chapter, you will delve into the world of Amazon Simple Storage Service (S3) and explore its vast potential for storing, organizing, and accessing your data. This chapter will provide you with a comprehensive understanding of Amazon S3 and how it can be leveraged for effective data management and manipulation.

Chapter 6, Working with AWS Glue: In this chapter, you will dive into the powerful capabilities of AWS Glue, a fully managed extract, transform, and load (ETL) service provided by AWS. This chapter will guide you through the process of leveraging AWS Glue to automate and streamline your data preparation and transformation workflows.

Chapter 7, Working with Athena: In this chapter, you will explore the powerful capabilities of Amazon Athena, a serverless query service that enables you to analyze data directly in Amazon S3 using standard SQL queries. This chapter will guide you through the process of leveraging Amazon Athena to unlock valuable insights from your data, without the need for complex data processing infrastructure.

Chapter 8, Working with QuickSight: In this chapter, you will discover the power of Amazon QuickSight, a fast, cloud-powered business intelligence (BI) service provided by AWS. This chapter will guide you through the process of leveraging QuickSight to create interactive dashboards and visualizations, enabling you to gain valuable insights from your data.

Chapter 9, Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas: In this chapter, you will explore the powerful combination of AWS Data Wrangler and pandas, a popular Python library for data manipulation and analysis. This chapter will guide you through the process of leveraging pandas operations within AWS Data Wrangler to perform advanced data transformations and analysis on your datasets.

Chapter 10, Data Processing for Machine Learning with SageMaker Data Wrangler: In this chapter, you will delve into the world of machine learning (ML) data optimization using the powerful capabilities of AWS SageMaker Data Wrangler. This chapter will guide you through the process of leveraging SageMaker Data Wrangler to preprocess and prepare your data for ML projects, maximizing the performance and accuracy of your ML models.

Chapter 11, Data Lake Security and Monitoring: In this chapter, you will be introduced to Identity and Access Management (IAM) on AWS and how closely Data Wrangler integrates with AWS’ security features. We will show how you can interact directly with Amazon Cloudwatch logs, query against logs, and return the logs as a data frame.

Data Wrangling on AWS

By : Navnit Shukla, Sankar M, Sampat Palani

Data Wrangling on AWS

By: Navnit Shukla, Sankar M, Sampat Palani

Overview of this book

Related Content you might be interested in

Current Title:

Data Wrangling on AWS

Data Engineering with AWS

Data Engineering with AWS

Modern Data Architecture on AWS

What this book covers