-
Book Overview & Buying
-
Table Of Contents
Scalable Data Architecture with Java
By :
The first step of any implementation is always understanding the source data. This is because all our low-level transformation and cleansing will be dependent on the variety of the data. In the previous chapter, we used DataCleaner to profile the data. However, this time, we are dealing with big data and the cloud. DataCleaner may not be a very effective tool for profiling the data if its size runs into the terabytes. For our scenario, we will use an AWS cloud-based data profiling tool called AWS Glue DataBrew.
In this section, we will learn how to do data profiling and analysis to understand the incoming data (you can find the sample file for this on GitHub at https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/tree/main/Chapter05. Follow these steps:
scalabledataarch using the AWS Management Console and upload the sample input data to the S3 bucket: