Book Image

Data Wrangling on AWS

By : Navnit Shukla, Sankar M, Sampat Palani
5 (1)
Book Image

Data Wrangling on AWS

5 (1)
By: Navnit Shukla, Sankar M, Sampat Palani

Overview of this book

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools. First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects. By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.
Table of Contents (19 chapters)
1
Part 1:Unleashing Data Wrangling with AWS
3
Part 2:Data Wrangling with AWS Tools
7
Part 3:AWS Data Management and Analysis
12
Part 4:Advanced Data Manipulation and ML Data Optimization
15
Part 5:Ensuring Data Lake Security and Monitoring

What is a data lake?

A data lake is a centralized repository that allows organizations to store all of their structured and unstructured data at any scale. This approach to data storage and management provides organizations with a single, unified platform for storing and managing data from a variety of different sources, including social media, sensors, and transactional systems.

Data lakes are designed to support the storage of large amounts of data in its raw format, allowing it to be processed and analyzed at a later stage by various teams within the organization. This approach to data storage and management provides organizations with the flexibility to collect and store data from a wide range of sources, without the need to preprocess or structure the data in any specific way.

One of the key benefits of using a data lake is that it allows organizations to store and manage data from a variety of different sources, including both structured and unstructured data. This means...