Book Image

Serverless Analytics with Amazon Athena

By : Anthony Virtuoso, Mert Turkay Hocanin, Aaron Wishnick
Book Image

Serverless Analytics with Amazon Athena

By: Anthony Virtuoso, Mert Turkay Hocanin, Aaron Wishnick

Overview of this book

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure. This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You’ll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you’ll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you’ll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you’ll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server. By the end of this book, you’ll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today’s ML modeling exercises.
Table of Contents (20 chapters)
1
Section 1: Fundamentals Of Amazon Athena
5
Section 2: Building and Connecting to Your Data Lake
9
Section 3: Using Amazon Athena
14
Chapter 11: Operational Excellence – Monitoring, Optimization, and Troubleshooting
15
Section 4: Advanced Topics

Running your first query

Athena supports both Data Definition Language (DDL) and Data Manipulation Language (DML) queries. Queries where you SELECT data from a table are a common example of DML queries. Our first meaningful Athena query will be a DDL query that creates, or defines, our NYC Taxis data table:

  1. Let's begin by ensuring our AWS account and IAM user/role are ready to use Athena. To do that, navigate to the Athena query editor in the AWS Console: https://console.aws.amazon.com/athena/home.

    Be sure to use the same region that you uploaded your data and created your database in.

  2. If this is your first time using Athena, you will likely be met by a screen like the following. Luckily, Athena is telling us that "Before you run your first query, you need to set up a query result location in Amazon S3…". Since Athena writes the results of all queries to S3, even DDL queries, we will need to configure this setting before we can proceed. To do so, click on the highlighted text in the AWS Console that's shown in the following screenshot:
    Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena

    Figure 1.1 – The prompt for setting the query result's location upon your first visit to Athena

  3. After clicking on the modal's link, you will see the following prompt so that you can set your query result's location. You can use the same S3 bucket we used to upload our sample data, with results being used as the name of the folder that Athena will write query results to within that bucket. Be sure your location ends with a "/" to avoid errors:
Figure 1.2 – Athena's settings prompt for the query result's location

Figure 1.2 – Athena's settings prompt for the query result's location

Next, let's learn how to create a table.

Creating your first table

It is now time to run our first Athena query. The following DDL query asks Athena to create a new table called nyc_taxi in the packt_serverless_analytics database, which is stored in the AWS Glue Data Catalog. The query also specifies the schema (columns), file format, and storage location of the table. For now, the other nuances of this create query are unimportant. You may find it easier to copy create table from the create_nyc_taxi.sql (http://bit.ly/3mXj3K0) file in the chapter_1 folder of this book's GitHub repository. Paste it into Athena's query editor, change LOCATION so that it matches your bucket name, and click Run query. It should complete in a few seconds:

CREATE EXTERNAL TABLE 'packt_serverless_analytics'.'nyc_taxi'(
  'vendorid' bigint, 
  'tpep_pickup_datetime' string, 
  'tpep_dropoff_datetime' string, 
  'passenger_count' bigint, 
  'trip_distance' double, 
  'ratecodeid' bigint, 
  'store_and_fwd_flag' string, 
  'pulocationid' bigint, 
  'dolocationid' bigint, 
  'payment_type' bigint, 
  'fare_amount' double, 
  'extra' double, 
  'mta_tax' double, 
  'tip_amount' double, 
  'tolls_amount' double, 
  'improvement_surcharge' double, 
  'total_amount' double, 
  'congestion_surcharge' double)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<YOUR_BUCKET_NAME>/tables/nyc_taxi/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'columnsOrdered'='true', 
  'compressionType'='gzip', 
  'delimiter'=',',
  'skip.header.line.count'='1', 
  'typeOfData'='file')

Once your table creation DDL query completes, the left navigation pane of the Athena console will refresh with the definition of your new table. If you have other databases and tables, you may need to choose your database from the dropdown before your new table will appear.

Figure 1.3 – Athena's Database navigator will show the schema of your newly created table

Figure 1.3 – Athena's Database navigator will show the schema of your newly created table

At this point, the significance of the query we just ran may not be entirely apparent, but rest assured we will go deeper into why serverless DDL queries are a powerful thing. Oh, and did we mention that Athena does not charge for DDL queries?

Running your first analytics queries

When working with a new or unfamiliar set of data, it can be helpful to view a sample of the rows before exploring the dataset in more meaningful ways. This allows you to understand the schema of your dataset, including verifying that the schema (for example, column names) match the values and types. There are a few ways to do this, including the following limit query:

SELECT * from packt_serverless_analytics.nyc_taxi limit 100

This works fine in most cases, but we can do better. Many query engines, Athena included, will end up returning all 100 rows requested in the preceding query from the same S3 object. If your dataset contains many objects or files, you are getting an extremely narrow view of the table. For that reason, I prefer using the following query to view data from a broader portion of the dataset:

SELECT *
FROM packt_serverless_analytics.nyc_taxi TABLESAMPLE BERNOULLI (1) 
limit 100

This query is like the earlier limit query but uses Athena's TABLESAMPLE feature to obtain our 100 requested rows using BERNOULLI sampling. When a table is sampled using the Bernoulli method, all the objects of the table may be scanned as opposed to likely stopping after the first object. This is because the probability of a row being included in the result is independent of any other row reducing the significance of the object scan order. In the following screenshot, we can see some of the rows that were returned using TABLESAMPLE with the BERNOULLI method:

Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table

Figure 1.4 – Results of executing TAMPLESAMPLE against our nyc_taxi table

While that query allowed us to confirm that Athena can indeed access our data and that the schema appears to match the data itself, we have not extracted any real insights from the data. For this, we will run our first real analytics query by generating a histogram of ride durations and distances. Our goal here is to learn how much time people are typically spending in taxis, but we'll also be able to gain insights into the quality of our data. The following query uses Athena's numeric_histogram function to approximate the distribution with 10 buckets according to the difference between tpep_pickup_datetime and tpep_dropoff_datetime. Since the dataset stores datetimes as strings, we are using the date_parse function to convert the values into actual timestamps that we can then use with Athena's date_diff function to generate the ride durations as minutes. Lastly, the query uses a CROSS JOIN with UNEST to turn the histogram into rows and columns. Normally, the numeric_histogram function returns a map containing the histogram, but this can be difficult to read. UNEST helps us turn it into a more intuitive tabular format. Do not worry about remembering all these functions and SQL techniques right now. Athena frequently adds new capabilities, and you can always consult a reference.

You can copy the following code from GitHub at http://bit.ly/2Jm6o5v:

SELECT ride_minutes, number_rides
    FROM (SELECT numeric_histogram(10,
        date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         )
    )
FROM packt_serverless_analytics.nyc_taxi ) AS x (ride_histogram)
CROSS JOIN 
    UNNEST(ride_histogram) AS t (ride_minutes, number_rides);

Once you run the query, the results will look as follows. You can experiment with the number of buckets that are generated by adjusting the parameters of the numeric_histogram function. Generating 100 or even 1,000 buckets can uncover patterns that were hidden with fewer buckets. Even with just 10 buckets, we can already see a strong correlation between the distance and the number of rides. I was surprised to see that such a large portion of the yellow cab rides lasted less than 7 minutes. From this query, we can also see some likely data quality issues in the dataset. Unless one of the June 2020 rides happened in a time-traveling DeLorean, we likely have an erroneous record. Less obvious is the fact that several hundred rides claim to have lasted longer than 24 hours:

Figure 1.5 – Ride duration histogram results

Figure 1.5 – Ride duration histogram results

Let's try one more histogram query, but this time, we will target the trip distance of the rides that took less than 7 minutes. The following code block contains the modified histogram query you can run to understand that bucket of rides. You can download it from GitHub at http://bit.ly/3hkggJl:

SELECT trip_distance, number_rides
FROM 
    (SELECT numeric_histogram(5,trip_distance)
       FROM packt_serverless_analytics.nyc_taxi 
       WHERE date_diff('minute',
         date_parse(tpep_pickup_datetime,'%Y-%m-%d %H:%i:%s'),
         date_parse(tpep_dropoff_datetime, '%Y-%m-%d %H:%i:%s')
         ) <= 6.328061
    ) AS x (ride_histogram)
CROSS JOIN UNNEST(ride_histogram) AS t (trip_distance , number_rides);

Considering that the average person can walk a mile in 15 minutes, New Yorkers must be in a serious hurry to opt for taxi rides instead of a 15-minute walk!

Figure 1.6 – Ride distance histogram results

Figure 1.6 – Ride distance histogram results

With that, we've been through the basics of AWS Athena. Let's conclude by providing a recap of what we've learned.