Book Image

Serverless Analytics with Amazon Athena

By : Anthony Virtuoso, Mert Turkay Hocanin, Aaron Wishnick
Book Image

Serverless Analytics with Amazon Athena

By: Anthony Virtuoso, Mert Turkay Hocanin, Aaron Wishnick

Overview of this book

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL, without needing to manage any infrastructure. This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 Data Lake using Athena, including how to structure your tables using open-source file formats like Parquet. You’ll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you’ll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving on, you’ll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure. You will also review tips to help diagnose and correct failing queries in your pursuit of operational excellence. Finally, you’ll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server. By the end of this book, you’ll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today’s ML modeling exercises.
Table of Contents (20 chapters)
1
Section 1: Fundamentals Of Amazon Athena
5
Section 2: Building and Connecting to Your Data Lake
9
Section 3: Using Amazon Athena
14
Chapter 11: Operational Excellence – Monitoring, Optimization, and Troubleshooting
15
Section 4: Advanced Topics

Obtaining and preparing sample data

Before we can start running our first query, we will need some data that we would like to analyze. Throughout this book, we will try to make use of open datasets that you can freely access but that also contain interesting information that may mirror your real-world datasets. In this chapter, we will be making use of the NYC Taxi & Limousine Commission's (TLC's) Trip Record Data for New York City's iconic yellow taxis. Yellow taxis have been recording and providing ride data to TLC since 2009. Yellow taxis are traditionally hailed by signaling to a driver who is on duty and seeking a passenger (also known as a street hail). In recent years, yellow taxis have also started to use their own ride-hailing apps such as Curb and Arro to keep pace with emerging ride-hailing technologies from Uber and Lyft. However, yellow taxis remain the only vehicles permitted to respond to street hails from passengers in NYC. For that reason, the dataset often has interesting patterns that can be correlated with other events in the city, such as a concert or inclement weather.

Our exercise will focus on just one of the many datasets offered by the TLC. The yellow taxis data includes the following fields:

  • VendorID: A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
  • tpep_pickup_datetime: The date and time when the meter was engaged.
  • tpep_dropoff_datetime: The date and time when the meter was disengaged.
  • Passenger_count: The number of passengers in the vehicle.
  • Trip_distance: The elapsed trip distance in miles reported by the taximeter.
  • RateCodeID: The final rate code in effect at the end of the trip. 1= Standard rate, 2= JFK, 3= Newark, 4= Nassau or Westchester, 5= Negotiated fare, 6= Group ride.
  • Store_and_fwd_flag: This flag indicates whether the trip record was held in the vehicle's memory before being sent to the vendor, also known as "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip, while N= not a store and forward trip.
  • pulocationid: Location where the meter was engaged.
  • dolocationid: Location where the meter was disengaged.
  • Payment_type: A numeric code signifying how the passenger paid for the trip. 1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown, 6= Voided trip.
  • Fare_amount: The time-and-distance fare calculated by the meter.
  • Extra: Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
  • MTA_tax: $0.50 MTA tax that is automatically triggered based on the metered rate in use.
  • Improvement_surcharge: $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
  • Tip_amount: This field is automatically populated for credit card tips. Cash tips are not included.
  • Tolls_amount: Total amount of all tolls paid in a trip.
  • Total_amount: The total amount charged to passengers. Does not include cash tips.
  • congestion_surcharge: Amount surcharges associated with time/traffic fees imposed by the city.

This dataset is easy to obtain and is relatively interesting to run analytics against. The inconsistency in field naming is difficult to overlook but we will normalize using a mixture of camel case and underscore conventions later:

  1. Our first step is to download the Trip Record Data for June 2020. You can obtain this directly from the NYC TLC's website (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) or our GitHub repository using the following command:
    wget https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena/raw/main/chapter_1/yellow_tripdata_2020-06.csv.gz

    If you choose to download it from the NYC TLC directly, please gzip the file before proceeding to the next step.

  2. Now that we have some data, we can add it to our data lake by uploading it to Amazon S3. To do this, we must create an S3 bucket. If you already have an S3 bucket that you plan to use, you can skip creating a new bucket. However, we do encourage you to avoid completing these exercises in accounts that house production workloads. As a best practice, all experimentation and learning should be done in isolation.
  3. Once you have picked a bucket name and the region that you would like to use for these exercises, you can run the following command:
    aws s3api create-bucket \
    --bucket packt-serverless-analytics \
    --region us-east-1

    Important Note

    Be sure to substitute your bucket name and region. You can also create buckets directly from the AWS Console by logging in and navigating to S3 from the service list. Later in this chapter, we will use the AWS Console to edit and run our Athena queries. For simple operations, using the AWS CLI can be faster and easier to see what is happening since the AWS Console can hide multi-step operations behind a single button.

  4. Now that our bucket is ready, we can upload the data we would like to query. In addition to the bucket, we will want to put our data into a subfolder to keep things organized as we proceed through later exercises. We have an entire chapter dedicated to organizing and optimizing the layout of your data in S3. For now, let's just upload the data to a subfolder called tables/nyc_taxi using the following AWS CLI command. Be sure to replace the bucket name, packt-serverless-analytics, in the following example command with the name of your bucket:
    aws s3 cp ./yellow_tripdata_2020-06.csv.gz \
    s3://packt-serverless-analytics/tables/nyc_taxi/yellow_tripdata_2020-06.csv.gz

    This command may take a few moments to complete since it needs to upload our roughly 10 MB file over the internet to Amazon S3. If you get a permission error or message about access being denied, double-check you used the right bucket name.

  5. If the command seems to have finished running without issue, you can use the following command to confirm the file is where we expect. Be sure to replace the example bucket with your actual bucket name:
    aws s3 ls s3://packt-serverless-analytics/tables/nyc_taxi/
  6. Now that we have confirmed our sample data is where we expect, we need to add this data to our Metastore, as described in the What is Amazon Athena? section. To do this, we will use AWS Glue Data Catalog as our Metastore by creating a database to house our table. Remember that Data Catalog will not store our data, just details about where engines such as Athena can find it (for example, S3) and what format was used to store the data (for example, CSV). Unlike Amazon S3, multiple accounts can have databases and tables with the same name so that you can use the following commands as-is, without the need to rename anything. If you already have a database that you would like to use, you can skip creating a new database, but be sure to substitute your database name into subsequent commands; otherwise, they will fail:
    aws glue create-database \
    --database-input "{\"Name\":\"packt_serverless_analytics\"}" \
    --region us-east-1

Now that both our data and Metastore are ready, we can define our table right from Athena itself by running our first query.