Book Image

Serverless Analytics with Amazon Athena

By : Anthony Virtuoso, Mert Turkay Hocanin, Aaron Wishnick
Book Image

Serverless Analytics with Amazon Athena

By: Anthony Virtuoso, Mert Turkay Hocanin, Aaron Wishnick

Overview of this book

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure. This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You’ll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you’ll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you’ll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you’ll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server. By the end of this book, you’ll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today’s ML modeling exercises.
Table of Contents (20 chapters)
1
Section 1: Fundamentals Of Amazon Athena
5
Section 2: Building and Connecting to Your Data Lake
9
Section 3: Using Amazon Athena
14
Chapter 11: Operational Excellence – Monitoring, Optimization, and Troubleshooting
15
Section 4: Advanced Topics

What is Amazon Athena?

Amazon Athena is a query service that allows you to run standard SQL over data stored in a variety of sources and formats. As you will see later in this chapter, Athena is serverless, so there is no infrastructure to set up or manage. You simply pay $5 per TB scanned for the queries you run without needing to worry about idle resources or scaling.

Note

AWS has a habit of reducing prices over time. For the latest Athena pricing, please consult the Amazon Athena product page at https://aws.amazon.com/athena/pricing/?nc=sn&loc=3.

Athena is based on Presto (https://prestodb.io/), a distributed SQL engine that's open sourced by Facebook. It supports ANSI SQL, as well as Presto SQL features ranging from geospatial functions to rough query extensions, which allow you to run approximating queries, with statistically bound errors, over large datasets in only a fraction of the time. Athena's commitment to open source also provides an interesting avenue to avoid lock-in concerns because you always have the option to download and manage your own Presto deployment from GitHub. Of course, you will lose many of Athena's enhancements and must manage the infrastructure yourself, but you can take comfort in knowing you are not beholden to potentially punitive licensing agreements as you might be with other vendors.

While Athena's roots are open source, the team at AWS have added several enterprise features to the service, including the following:

  • Federated Identity via SAML and Active Directory support
  • Table, column, and even row-level access control via Lake Formation
  • Workload classification and grouping for cost control via WorkGroups
  • Automated regression testing to take the pain out of upgrades

Later chapters will cover these topics in greater detail. If you feel compelled to do so, you can use the table of contents to skip directly to those chapters and learn more.

Let's look at some use cases for Athena.

Use cases

Amazon Athena supports a wide range of use cases and we have personally used it for several different patterns. Thanks to Athena's ease of use, it is extremely common to leverage Athena for ad hoc analysis and data exploration.

Later in this book, you will use Athena from within a Jupyter notebook for machine learning. Similarly, many analysts enjoy using Athena directly from BI Tools such as Looker and Tableau, courtesy of Athena's JDBC driver. Athena's robust SQL dialect and asynchronous API model also enables application developers to build analytics right into their applications, enabling features that would not previously have been practical due to scale or operational burden. In many cases, you can replace RDBMS-driven features with Athena at a fraction of the cost and lower operational burden.

Another emerging use case for Athena is in the ETL space. While Athena advertises itself as being an engine that avoids the need for ETL by being able to query the data in place, as it is, we have seen the benefits of replacing existing or building new ETL pipelines using Athena where cost and capacity management are key factors. Athena will not necessarily achieve the same scale or performance as Spark, for example, but if your ETL jobs do not require multi-TB joins, you might find Athena to be an interesting option.

Separation of storage and compute

If you are new to serverless analytics, you may be wondering where your data is stored. Amazon Athena builds on the concept of Separation of Storage and Compute to decouple the computational resources (for example, CPU, memory, network) that do the heavy lifting of executing your SQL queries from the responsibility of keeping your data safe and available. In short, this means Athena itself does not store your data. Instead, you are free to choose from several data stores with customers increasingly pairing with DynamoDB to rapidly mutate data with S3 for their bulk data. With Athena, you can easily write a query that spans both data stores.

Amazon's Simple Storage Service, or S3 for short, is easily the most recommended data store to use with Athena. When Athena launched in 2016, S3 was the first data store it supported. Unsurprisingly, Athena has been optimized to take advantage of S3's unique ability to deliver exabyte scale and throughput while still providing eleven nines (99.999999999%) of durability. In addition to effortless scaling from a few gigabytes of data up to many petabytes, S3 offers some of the lowest prices for performance that you can find. Depending on your replication requirements, storing 1 GB of data for a month will cost you between $0.01 and $0.023. Even the most cost-efficient enterprise hard drives cost around $0.21 per GB before you add on redundancy, the power to run them, or a server and data center to house them. As with most AWS services, you should consult S3's pricing page (https://aws.amazon.com/s3/pricing/) for the latest details since AWS has cut their prices more than 70 times in the last decade.

Metastore

In addition to accessing the raw 1s and 0s that represent your data, Athena also requires metadata that helps its SQL engine understand how to interpret the data you have stored in S3 or elsewhere. This supplemental information helps Athena map collections of files, or objects in the case of S3, to SQL constructs such as tables, columns, and rows. The repository for this data, about your data, is often called a metastore. Athena works with Hive-compliant metastores, including AWS's Glue Data Catalog service. In later chapters, we will look at AWS Glue Data Catalog in more detail, as well as how you can attach Athena to your own metastore, even a homegrown one. For now, all you need to know is that Athena requires the use of a metastore to discover key attributes of the data you wish to query. The most common pieces of information that are kept in the Metastore include the following:

  • A list of tables that exist
  • The storage location of each table (for example, the S3 path or DynamoDB table name)
  • The format of the files or objects that comprise the table (for example, CSV, Parquet, JSON)
  • The column names and data types in each table (for example, inventory column is an integer, while revenue is a decimal (10,2))

Now that we have a good overview of Amazon Athena, let's look at how to use it in practice.