Data Wrangling on AWS

By : Navnit Shukla, Sankar M, Sampat Palani

5 (1)

Buy this Book

Data Wrangling on AWS

5 (1)

By: Navnit Shukla, Sankar M, Sampat Palani

Buy this Book

Overview of this book

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools. First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects. By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1:Unleashing Data Wrangling with AWS

Free Chapter

Chapter 1: Getting Started with Data Wrangling

Introducing data wrangling

The steps involved in data wrangling

Best practices for data wrangling

Options available for data wrangling on AWS

Summary

Part 2:Data Wrangling with AWS Tools

Chapter 2: Introduction to AWS Glue DataBrew

Why AWS Glue DataBrew?

Getting started with AWS Glue DataBrew

Using AWS Glue DataBrew for data wrangling

Data protection with AWS Glue DataBrew

Data lineage and data publication

Summary

Chapter 3: Introducing AWS SDK for pandas

AWS SDK for pandas

Building blocks of AWS SDK for pandas

Customizing, building, and installing AWS SDK for pandas for different use cases

Configuration options for AWS SDK for pandas

The features of AWS SDK for pandas with different AWS services

Summary

Chapter 4: Introduction to SageMaker Data Wrangler

Data import

Data orchestration

Data transformation

Insights and data quality

Data analysis

Data export

SageMaker Studio setup prerequisites

Summary

Part 3:AWS Data Management and Analysis

Chapter 5: Working with Amazon S3

Challenges and considerations when building a data lake on Amazon S3

Summary

Chapter 6: Working with AWS Glue

What is Apache Spark?

Data discovery with AWS Glue

Data ingestion using AWS Glue ETL

Summary

Chapter 7: Working with Athena

Understanding Amazon Athena

Advanced data discovery and data structuring with Athena

Enriching data from multiple sources using Athena

Setting up a serverless data quality pipeline with Athena

Summary

Chapter 8: Working with QuickSight

Introducing Amazon QuickSight and its concepts

Data discovery with QuickSight

Data visualization with QuickSight

Summary

Part 4:Advanced Data Manipulation and ML Data Optimization

Chapter 9: Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas

A solution walkthrough for sportstickets.com

Data quality validation

Data visualization

Summary

Chapter 10: Data Processing for Machine Learning with SageMaker Data Wrangler

Technical requirements

Step 1 – logging in to SageMaker Studio

Step 2 – importing data

Exploratory data analysis

Step 4 – adding transformations

Step 5 – exporting data

Training a machine learning model

Summary

Part 5:Ensuring Data Lake Security and Monitoring

Chapter 11: Data Lake Security and Monitoring

Data lake security

Monitoring and auditing

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 (1)

5 star

100%

4 star

3 star

2 star

1 star

The steps involved in data wrangling

Similar to crude oil, raw data has to go through multiple data wrangling steps to become meaningful. In this section, we are going to learn the six-step process involved in data wrangling:

Data discovery
Data structuring
Data cleaning
Data enrichment
Data validation
Data publishing

Before we begin, it’s important to understand these activities may or may not need to be followed sequentially, or in some cases, you may skip any of these steps.

Also, keep in mind that these steps are iterative and differ for different personas, such as data analysts, data scientists, and data engineers.

As an example, data discovery for data engineers may vary from what data discovery means for a data analyst or data scientist:

Figure 1.5: The steps of the data-wrangling process

Let’s start learning about these steps in detail.

Data discovery

The first step of the data wrangling process is data discovery. This is one of the most important steps of data wrangling. In data discovery, we familiarize ourselves with the kind of data we have as raw data, what use case we are looking to solve with that data, what kind of relationships exist between the raw data, what the data format will look like, such as CSV or Parquet, what kind of tools are available for storing, transforming, and querying this data, and how we wish to organize this data, such as by folder structure, file size, partitions, and so on to make it easy to access.

Let’s understand this by looking at an example.

In this example, we will try to understand how data discovery varies based on the persona. Let’s assume we have two colleagues, James and Jean. James is a data engineer while Jean is a data analyst, and they both work for a car-selling company.

Jean is new to the organization and she is required to analyze car sales numbers for Southern California. She has reached out to James and asked him for data from the sales table from the production system.

Here is the data discovery process for Jane (a data analyst):

Jane has to identify the data she needs to generate the sales report (for example, sales transaction data, vehicle details data, customer data, and so on).
Jane has to find where the sales data resides (a database, file share, CRM, and so on).
Jane has to identify how much data she needs (from the last 12 months, the last month, and so on).
Jane has to identify what kind of tool she is going to use (Amazon QuickSight, Power BI, and so on).
Jane has to identify the format she needs the data to be in so that it works with the tools she has.
Jane has to identify where she is looking to store this data – in a data lake (Amazon S3), on her desktop, a file share, and sandbox environment, and so on.

Here is the data discovery process for James (a data engineer):

Which system has requested data? For example, Amazon RDS, Salesforce CRM, Production SFTP location, and so on.
How will the data be extracted? For example, using services such as Amazon DMS or AWS Glue or writing a script.
What will the schedule look like? Daily, weekly, or monthly?
What will the file format look like? For example, CSV, Parquet, orc, and so on.
How will the data be stored in the provided store?

Data structuring

To support existing and future business use cases to serve its customers better, the organization must collect unprecedented amounts of data from different data sources and in different varieties. In modern data architecture, most of the time, the data is stored in data lakes since a data lake allows you to store all kinds of data files, whether it is structured data, unstructured data, images, audio, video, or something else, and it will be of different shapes and sizes in its raw form. When data is in its raw form, it lacks a definitive structure, which is required for it to be stored in databases or data warehouses or used to build analytics or machine learning models. At this point, it is not optimized for cost and performance.

In addition, when you work with streaming data such as clickstreams and log analytics, not all the data fields (columns) are used in analytics.

At this stage of data wrangling, we try to optimize the raw dataset for cost and performance benefits by performing partitioning and converting file types (for example, CSV into Parquet).

Once again, let’s consider our friends James and Jean to understand this.

For Jean, the data analyst, data structuring means that she is looking to do direct queries or store data in a memory store of a BI tool, in the case of Amazon QuickSight called the SPICE layer, which provides faster access to data.

For James, the data engineer, when he is extracting data from a production system and looking to store it in a data lake such as Amazon S3, he must consider what the file format will look like. He can partition it by geographical regions, such as county, state, or region, or by date – for example, year=YYYY, month=MM, and day=DD.

Data cleaning

The next step of the data wrangling process is data cleaning. The previous two steps give us an idea of how the data looks and how it is stored. In the data cleaning step, we start working with raw data to make it meaningful so that we can define future use cases.

In the data cleaning step, we try to make data meaningful by doing the following:

Removing unwanted columns, duplicate values, and filling null value columns to improve the data’s readiness
Performing data validation such as identifying missing values for mandatory columns such as First Name, Last Name, SSN, Phone No., and so on
Validating or fixing data type for better optimization of storage and performance
Identifying and fixing outliers
Removing garbage data or unwanted values, such as special characters

Both James and Jane can perform similar data cleaning tasks; however, their scale might vary. For James, these tasks must be done for the entire dataset. For Jane, they may only have to perform them on the data from Southern California, and granularity might vary as well. For James, maybe it is only limited to regions such as Southern California, Northern California, and so on, while for Jean, it might be city level or even ZIP code.

Data enrichment

Up until the data cleaning step, we were primarily working on single data sources and making them meaningful for future use. However, in the real world, most of the time, data is fragmented and stored in multiple disparate data stores, and to support use cases such as building personalization or recommendation solutions or building Customer 360s or log forensics, we need to join the data from different data stores.

For example, to build a Customer 360 solution, you need data from the Customer Relationship Manager (CRM) systems, clickstream logs, relational databases, and so on.

So, in the data enrichment step, we build the process that will enhance the raw data with relevant data obtained from different sources.

Data validation

There is a very interesting term in computer science called garbage in, garbage out (GIGO). GIGO is the concept that flawed or defective (garbage) input data produces defective output.

In other words, the quality of the output is determined by the quality of the input. So, if we provide bad data as input, we will get inaccurate results.

In the data validation step, we address this issue by performing various data quality checks:

Business validation of data accuracy
Validate data security
Validate result consistency across the entire dataset
Validate data quality by validating data quality checks such as the following:
- Number of records
- Duplicate values
- Missing values
- Outliers
- Distinct values
- Unique values
- Correlation

There is a lot of overlap between data cleaning and data validation and yes, there are a lot of similarities between these two processes. However, data validation is done on the resulting dataset, while data cleaning is primarily done on the raw dataset.

Data publishing

After completing all the data wrangling steps, the data is ready to be used for analytics so that it can solve business problems.

So, the final step is to publish the data to the end user with the required access and permission.

In this step, we primarily concentrate on how the data is being exposed to the end user and where the final data gets stored – that is, in a relational database, a data warehouse, curated or user zones in a data lake, or through the Secure File Transfer Protocol (SFTP).

The choice of data storage depends on the tool through which the end user is looking to access the data. For example, if the end user is looking to access data through BI tools such as Amazon QuickSight, Power BI, Informatica, and so on, a relational data store will be an ideal choice. If it is accessed by a data scientist, ideally, it should be stored in an object store.

We will learn about the different kinds of data stores we can use to store raw and wrangled data later in this book.

In this section, we learned about the various steps of the data wrangling process through our friends James and Jean and how these steps may or may not vary based on personas. Now, let’s understand the best practices for data wrangling.

Data Wrangling on AWS

By : Navnit Shukla, Sankar M, Sampat Palani

Data Wrangling on AWS

By: Navnit Shukla, Sankar M, Sampat Palani

Overview of this book

Related Content you might be interested in

Current Title:

Data Wrangling on AWS

Data Engineering with AWS

Data Engineering with AWS

Modern Data Architecture on AWS

The steps involved in data wrangling

Data discovery

Data structuring

Data cleaning

Data enrichment

Data validation

Data publishing