Discovering the AWS analytics ecosystem
AWS provides a large number of analytics services. In addition to that, AWS has a number of partners who specialize in data analytics and offer analytics solutions that run on the AWS infrastructure. Partner solutions are not in the scope of this section, however. This section focuses on the AWS fully managed analytics services. In order to list the services, we will first define the specific categories related to analytics functions. Machine learning and predictive analytics are also out of the scope of this chapter. For every service category, we will then list the AWS services available, and for each service, we will provide a high-level description. Figure 1.1 depicts the commonly used AWS analytics services.
More and more organizations aspire to be data-driven and use data to drive their strategic decisions. Business intelligence (BI) tools help organizations to transform data into actionable insights. With the use of BI tools, users can analyze data and then present their findings in reports or dashboards. These reports or dashboards can then be consumed by business users who are interested in getting a picture of the state of the business.
In 2015, AWS launched Amazon QuickSight, a cloud-native BI tool. Since then, AWS has added new features to QuickSight, enriching the standard dashboard functionality with machine learning capabilities and offering embedded dashboard functionality. Amazon QuickSight is the main technology we will be covering in this book. Over the next few chapters, we will start with the basic functionality of Amazon QuickSight, and then we will explore more advanced features. Where possible, we will use practical examples that can be repeated in your own development environment, to give you hands-on experience with Amazon QuickSight.
Data warehouses are repositories of data; they are important components of the BI process. Data stored in data warehouses is typically structured. Traditionally, data is ingested and centralized into data warehouses from different operational data stores. Data warehouses are optimized to run analytical queries over large amounts of data. The results of analytical queries are usually calculated after an aggregation over multiple rows from one or more tables. BI applications use analytical queries to aggregate data and visualize it. It is a common architectural approach to use a data warehouse to serve data to a BI application.
Back in 2012, AWS launched Amazon Redshift, a cloud-native, fully managed data warehouse service. Today, Redshift is one of the most popular cloud data warehouses with thousands of organizations from different verticals using it to analyze their data. Other popular cloud data warehouses include Snowflake and Google BigQuery. Amazon Redshift integrates with most BI tools and it integrates natively with Amazon QuickSight. We will discuss this topic in more detail in Chapter 3, Preparing Data with Amazon QuickSight, when we look more closely into Amazon QuickSight-supported data sources.
Data lake storage and governance
A data lake is a repository of data where organizations can easily centralize all of their data and apply it in different use cases such as reporting, visualization, big data analytics, and predictive analytics. Data stored in data lakes can be structured or semi-structured. Usually, data is ingested into the data lake in its raw format, and is then transformed and stored back into the data lake for further processing and analysis. A cloud data lake typically uses a cloud object store to store data. AWS introduced Amazon Simple Storage Service (S3) in March 2006, offering developers a highly scalable, reliable, and low-latency data storage infrastructure at very low cost. Amazon S3 can store an unlimited amount of data, a particularly useful feature for data lakes. Organizations have one less thing to worry about because they don't need to think about scaling their storage as the amount of data stored grows.
While scaling data lake storage is something that organizations and CIOs don't need to worry about much anymore, data lake governance needs to be considered carefully. Data lakes do not enforce data schemas or data formats and, without any governance, data lakes can degrade into unusable data repositories, often referred to as data swamps. AWS offers a number of services for data governance.
The AWS Glue Catalog is part of the AWS Glue service. It is a fully managed Apache Hive metastore-compatible data catalog. Big data applications (for example, Apache Spark, Apache Hive, Presto, and so on) use the metadata in the catalog to locate and parse data. The AWS Glue Catalog is a technical metadata repository and can catalog data in Amazon S3, and a number of relational or non-relational data stores including Redshift, Aurora, and DynamoDB, among others.
AWS Lake Formation runs on top of AWS Glue and Amazon S3 and provides a governance layer and access layer for data lakes on Amazon S3. It also provides a set of reusable ETL jobs, called blueprints, that can be used to perform common ETL tasks (for example, loading data from a relational data store into an S3 data lake). Lake Formation allows users to manage access permissions, using a familiar GRANT REVOKE syntax that you might have seen in relational database management systems (RDBMSes).
Amazon Macie is an AWS service for data protection. It provides an inventory of Amazon S3 buckets and it uses machine learning to identify and alert its users about sensitive data, such as personally identifiable information (PII).
Finally, and perhaps most importantly, AWS Identity and Access Management (IAM) is a fundamental AWS service that allows users to assign permissions to principals (for example, users, groups, or roles) and explicitly allow or deny access to AWS resources including data lake locations or tables in the data catalog.
Ad hoc analytics
Ad hoc analytics refers to getting answers from the data on an as-needed basis. Contrary with what happens with scheduled reports, ad hoc querying is initiated by a user when they need to get specific answers from their data. The user typically uses SQL via a workbench type of application or other analytics frameworks (for instance, Apache Spark) using notebook environments or other BI applications. AWS has a number of analytics services that can be used for ad hoc analytics.
Amazon Redshift can be used for ad hoc analysis of data. For ad hoc querying, users will typically connect to Amazon Redshift using a query editor application with the Redshift JDBC/ODBC drivers. Notebook integrations or BI tool integrations are also possible for ad hoc analysis. AWS offers a number of managed notebook environments such as EMR notebooks and SageMaker notebooks. Amazon Redshift also allows its users to query data that is stored outside the data warehouse. Amazon Redshift Spectrum allows Redshift users to query data stored in Amazon S3, eliminating the need to load the data first before querying. Redshift's federated querying capability allows users to query live data in operational data stores such as PostgreSQL and MySQL.
For big data and data lakes, Presto is a popular choice for ad hoc analysis. Presto provides a high-performance parallel SQL query engine. Amazon Athena lets users run Presto queries in a scalable serverless environment. Amazon QuickSight natively supports Amazon Athena. We will talk more about this native integration in Chapter 3, Preparing Data with Amazon QuickSight. Amazon EMR is a fully managed Hadoop cluster, and it comes with a range of applications from the open source big data ecosystem. Presto has two community projects, PrestoSQL and PrestoDB, both of which are part of the Amazon EMR service. Other options included with EMR are Hive on EMR and Spark on EMR.
Extract, transform, load
ETL is a term used to describe a set of processes to extract, transform, and load data usually for analytical purposes. Organizations gather data from different data sources and centralize them in a central data repository. Data from different sources typically has different schemas and different conventions and standards, and therefore it can be challenging to combine them to get the required answers. For that reason, data needs to transformed so that it can work together. For example, cleaning the data, applying certain data quality thresholds, and standardizing to a specific standard (for instance, date and time formats used) are all important tasks to ensure the data is useable. A visual representation of the ETL process is shown in the following figure.
AWS Glue is a fully managed ETL service offered by AWS. When it was first introduced in 2017, Glue ETL offered an Apache Spark environment optimized for ETL. Now, Glue ETL offers a wider range of options:
- PySpark – Apache Spark using Python
- Spark with Scala – Apache Spark with Scala
- Python shell – For smaller ETL jobs that don't need a Spark cluster
- Glue Studio and Glue Databrew – Visual approach to ETL without the need to write code
Amazon EMR transient clusters, with applications such as Spark or Hive, can be leveraged for ETL workloads. ETL workloads can be bulk or streaming: streaming ETL workloads usually need to be up and running constantly, or at least for as long as the source stream is on; batch ETL workloads don't need to run at all times and they can stop once the data is loaded into the target system. This type of workload fits nicely with the flexibility of the cloud. With the cloud, data architects don't need to think of Hadoop clusters as big monolithic clusters. Instead, users prefer purpose-built transient clusters, optimized and sized to handle specific workloads and data loads.
Now that we've had our overview of the AWS analytics ecosystem, let's learn about modern data architecture and how they are built.