Book Image

Simplify Big Data Analytics with Amazon EMR

By : Sakti Mishra
Book Image

Simplify Big Data Analytics with Amazon EMR

By: Sakti Mishra

Overview of this book

Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.
Table of Contents (19 chapters)
1
Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
6
Section 2: Configuration, Scaling, Data Security, and Governance
11
Section 3: Implementing Common Use Cases and Best Practices

Decoupling compute and storage

When you integrate an EMR cluster for your batch or streaming workloads, you have the option to use the core node's HDFS as your primary distributed storage or Amazon S3 as your distributed storage layer. As you know, Amazon S3 provides a highly durable and scalable storage solution and Amazon EMR natively integrates with it.

With Amazon S3 as the cluster's distributed storage, you can decouple compute and storage, which gives additional flexibility. It enables you to integrate job-based transient clusters, where S3 acts as a permanent store and the cluster core node's HDFS is used for temporary storage. This way, you can decouple different jobs to have their own cluster with the required amount of resources and scaling in place and avoid having an always-on cluster to save costs.

The following diagram represents how multiple transient EMR clusters that contain various steps can use S3 as their common persistent storage layer. This can also help for disaster recovery implementation:

Figure 1.3 – Multiple EMR clusters using Amazon S3 as their distributed storage

Figure 1.3 – Multiple EMR clusters using Amazon S3 as their distributed storage

Now that you understand how EMR provides flexibility to decouple compute and storage, in the next section, you will learn how you can use this feature to create persistent or transient clusters depending on your use case.

Persistent versus transient clusters

Persistent clusters represent a cluster that is always active to support multi-tenant workloads or interactive analytics. These clusters can have a constant node capacity or a minimal set of nodes with autoscaling capabilities. Autoscaling is a feature of EMR, where EMR automatically scales up (adds nodes) or scales down (removes nodes) cluster resources based on a few cluster utilization parameters. In future chapters, we will dive deep into EMR scaling features and options.

Transient clusters are treated more as job-based clusters, which are short-lived. They get created with data arrival or through scheduled events, do the data processing, write the output back to target storage, and then get terminated. These also have a constant set of nodes to start with and then scale to support the additional workloads. But when you have transient cluster workloads, ideally Amazon S3 is used as a persistent data store so that after cluster termination, you still have access to the data to perform additional ETL or business intelligence reporting.

Here is a diagram that represents different kinds of cluster use cases you may have:

Figure 1.4 – EMR architecture representing cluster nodes

Figure 1.4 – EMR architecture representing cluster nodes

As you can see, all three clusters are using Amazon S3 as their persistent storage layer, which decouples compute and storage. This will facilitate scaling for both compute and storage independently, where Amazon S3 provides scaling with 99.999999999% (11 9s) durability and the cluster compute capacity can scale horizontally by adding more core or task nodes.

As represented in the diagram, transient clusters can be scheduled jobs or multiple workload-specific clusters running in parallel to do ETL on their datasets, where each workload cluster might have workload-specific cluster capacity.

When you implement transient clusters, often the best practice is to externalize your Hive Metastore, which means if your cluster gets terminated and becomes active again, it does not need to create Metastore or catalog tables again. When you are externalizing Hive Metastore of your EMR cluster, you have the option to use an Amazon RDS database as a Hive Metastore or you can use AWS Glue Data Catalog as your Metastore.