Exploring the modern data architecture on AWS
The modern data architecture is a modern data analytics architecture: as the name suggests, it combines the data lake and the data warehouse into a seamless system. This approach extends the traditional data warehouse approach and opens up new possibilities for data analytics. For this reason, it is important to understand this architecture, which can be used as a data backend for Amazon QuickSight or other BI applications. To understand the architecture better, let's first start by understanding the differences between a data lake and data warehouse.
Data lakes versus data warehouses
Data lakes and data warehouses are designed to consume large amounts of data for analytics purposes. Data warehouses are traditional database systems, used by organizations and enterprises for years. Data lakes, on the other side, are relatively young implementations that emerged from the big data and Hadoop ecosystems. Tables stored in data warehouses need to have clearly defined schemas. The schema needs to be defined upfront, before any data is added. This approach is called schema on write, and it ensures that data conforms to a specific structure before being ingested into the data warehouse. However, it can be less flexible, and it may introduce complexity when dealing with evolving schemas. Evolving schemas are an increasingly common scenario because organizations need to capture more and more data points from their customer interactions to drive data-driven decisions.
On the other side, data lakes don't enforce a schema upfront. Instead, applications that have the required permissions can write data to a data lake. Structure and data formats aren't enforced by the data lake: it is a responsibility of the writing application.
Data stored in a data lake has few to no limitations regarding its format: it can be structured, semi-structured, or completely unstructured. For many datasets, a schema can be inferred, either because the data is semi-structured (CSV, JSON, and others), or they follow patterns that can be identified after applying regular expressions and extracting specific columns. In data lakes, the schema is inferred when the data is read by the querying application. This approach is called schema on read, and it gives an organization flexibility regarding the data type stored. However, it also introduces challenges with data complexity and enforcing data quality.
For that reason, it is common that data that lands into the data lake goes through a series of transformations to get to a stage where it is useable. The first stage, often referred to as the raw layer, is where the data first lands, and it is stored as is.
After the data has landed, the first series of transformations is applied and the data is stored at the processed layer. Since the data can be of any format, the types of possible transformations are limitless. To give just some examples, data quality functions can be applied at this stage to remove incomplete rows and standardize the data in line with a specific datetime or time zone format. Other data engineering activities can also be performed at this stage, such as converting data into different file data formats optimized for analytics, or organizing them into folders using specific information (usually temporal) that can be later used as a partition column by the querying application.
Finally, data can then be converted for specific use cases and landed into the target layer. As an example, data can be transformed in a way that is relevant for a specific machine learning algorithm to work with the data. Another use case could be BI applications, such as Amazon QuickSight, where data can be pre-joined or aggregated and therefore reduced from a large dataset into a smaller dataset that is easier to visualize. Additional data engineering can be applied at this stage to optimize for performance.
modern data architecture on AWS
- Amazon Redshift is the data warehouse service.
- Amazon S3 is the object store that can be used for cloud data lake storage.
- AWS Glue is the data lake catalog to store technical metadata.
AWS Glue Catalog tables can be stored in Amazon Redshift, providing a unified metadata catalog across both the data warehouse and the S3 data lake.
Ability to query the data lake from the data warehouse
Redshift Spectrum is a feature of Redshift that allows you to perform SQL queries against data in the S3 data lake. The queries are triggered directly from the data warehouse, and therefore you don't need to connect to a different environment to submit your queries. You need to define the Spectrum tables as external tables on their data warehouse. The Redshift cluster also needs to have permission to access the data lake S3 location(s). The Redshift cluster will need to be assigned an IAM role, which needs to have access to the desired S3 locations.
Another key characteristic of Redshift Spectrum is that the Spectrum queries are running in the Spectrum nodes that are outside of the Redshift cluster. This effectively extends the Redshift cluster with additional compute capacity when data lake data needs to be queried.
Finally, Spectrum tables and Redshift tables can be combined and joined. Without this feature, you would have to move data and collocate it before joining it.
Ability to load data from the data lake
Redshift can efficiently load data from the S3 data lake. Specifically, Redshift's
COPY command can load data in parallel from Amazon S3. You (at a minimum) need to define a table name, the data location (commonly S3), and the authorization to access the data in the source location. When loading multiple files from S3, Redshift parallelizes the loading by allocating each file to a Redshift slice (the unit of processing in Redshift).
Ability to unload data to the data lake
Redshift also comes with the ability to unload data from the data warehouse back to the data lake. Specifically, the
UNLOAD command unloads the result of the query onto Amazon S3. You (as a minimum) need to specify the S3 location and the authorization. There are more options, such as defining the file format (using the
FORMAT AS option) or applying partitioning (using the
PARTITION BY option), and others.
In the following diagram, we see an example data pipeline that is using both a data warehouse and a data lake on AWS. Data is loaded from the operational data stores into the Amazon S3 object store in the raw layer of the data lake. Then, with a set of ETL jobs, the data reaches a stage that can be loaded into the data warehouse for BI purposes. For cost-effectiveness, you might not want to load all the data into the warehouse. Instead, you might want to leave the data in the data lake but have the ability to query the data when needed. This architecture considers the temperature of the data (how frequently the data is accessed) to determine the best storage. Hot data that needs to be accessed frequently is loaded into the data warehouse, while colder data remains in the data lake, a cheaper long-term storage option.
Now that we have had an overview of the modern data architecture on AWS, let's build a basic modern data architecture on AWS.