There is a reason why deploying Spark on Amazon EMR is added as one of the first recipes in this edition of the book. The majority of the production deployments of Spark happen on EMR (in fact, the majority, and increasingly so, big data deployments happen on EMR). If you understand this recipe well, you may skip the rest of the recipes in this chapter, unless you are doing an on-premises deployment.
Note
Since this topic is of paramount importance in the current context, a lot more theory is being provided than what a typical cookbook would have. You can skip the theory section and directly go to the How to do it.. section, but I encourage you not to do so.
What EMR represents is far more than meets the eye. Most of the enterprise workloads are migrating to public clouds at an accelerated pace. Once migrated, these workloads get rearchitected to leverage cloud-based services as opposed to simply using it as Infrastructure as a Service (IaaS). EC2 is an IaaS compute service of AWS, while EMR is the leading Platform as a Service (PaaS) service of AWS, with more big data workloads running on EMR than the alternatives combined.
Hadoop's core feature is data locality, that is, taking compute to where the data is. AWS disrupts this concept by separating storage and compute. AWS has multiple storage options, including the following:
- Amazon S3: S3 is general-purpose object storage.
- Amazon Redshift: This is a distributed cloud data warehouse.
- Amazon DynamoDB: This is a NoSQL database.
- Amazon Aurora: This is a cloud-based relational database.
Amazon S3 is the cheapest and most reliable cloud storage available, and this makes it the first choice, unless there is a compelling reason not to do so. EMR also supports attaching elastic block storage (EBS) volumes to compute instances (EC2) in order to provide a lower latency option.
Which option to choose depends upon what type of cluster is being created. There are two types of clusters:
- Persistent cluster: It runs 24 x 7. Here, there is a continuous analysis of data for use cases such as fraud detection in the financial industry or clickstream analytics in ad tech. For these purposes, HDFS mounted on EBS is a good choice.
- Transient cluster: Here, workloads are run inconsistently, for example, genome sequencing or holiday surge in retail. In this case, the cluster is only spawned when needed, making Elastic Map Reduce File System (EMRFS) based on S3 a better choice.
- Log in to https://aws.amazon.com with your credentials.
- Click on
Services
and select/search for EMR:
- Click on
Create cluster
and select the last option in theApplications
option box:
- Click on
Create Cluster
and the cluster will start as follows:
- Once the cluster is created with the given configuration, the
My Cluster
status will change toWaiting
, as shown in the following screenshot:
- Now add a step to select the JAR file; it takes the input file from the S3 location and produces the output file and stores it in the desired S3 bucket:
- The wordcount step's status will change to completed, indicating a successful completion of the step, as shown in the following screenshot:
- The output will be created and stored in the given S3 location. Here, it is in the output folder under the
io.playground
bucket:
Let's look at the options shown in step 3:
- Cluster name: This is where you provide an appropriate name for the cluster.
- S3 folder: This is the folder location where the S3 bucket's logs for this cluster will go to.
- Launch mode:
- Cluster: The cluster will continue to run until you terminate it.
- Step execution: This is to add steps after the application is launched.
- Software configuration:
- Vendor: This is Amazon EMI with the open source Hadoop versus MapR's version.
- Release: This is self-evident.
- Applications:
- Core Hadoop: This is focused on the SQL interface.
- HBase: This is focused on partial no-SQL-oriented workloads.
- Presto: This is focused on ad-hoc query processing.
- Spark: This is focused on Spark.
- Hardware configuration:
- Instance type: This topic will be covered in detail in the next section.
- Number of instances: This refers to the number of nodes in the cluster. One of them will be the master node and the rest slave nodes.
- Security and access:
- EC2 key pair: You can associate an EC2 key pair with the cluster that you can use to connect to it via SSH.
- Permissions: You can allow other users besides the default Hadoop user to submit jobs.
- EMR role: This allows EMR to call other AWS services, such as EC2, on your behalf.
- EC2 instance profile: This provides access to other AWS services, such as S3 and DynamoDB, via the EC2 instances that are launched by EMR.
EC2 instances are the most expensive part of a company's AWS bill. So, selecting the right instance type is the key through which you can optimize your bill. The following section is a quick overview of the different instance types. Instance types, both in the cloud and on premises, are defined by four factors:
- Number of cores
- Memory
- Storage (size and type)
- Network performance
To see a quick illustration of how these factors affect each other, visit http://youtube.com/infoobjects.
In the EC2 world, these factors have been modified slightly to vCPU. vCPU is a virtualized unit of:
- Memory
- Storage (size and type)
- Network performance
Instance type families are defined by the ratio of these factors, especially vCPU to memory. In a given family, this ratio remains unchanged (T2 excluded). Different instance families serve different purposes, almost like different types of automobiles. In fact, we are going to use the automobile metaphor in this section to illustrate these families.
The T2 instance type is a gateway drug in the AWS world, the reason being it belongs to Free Tier. Developers who sign up for AWS get this instance type for up to a year. This tier has six subtypes:
Instance Type | vCPUs | CPU Credit/Hr | Memory (GiB) |
t2.micro | 1 | 6 | 1 |
t2.small | 1 | 12 | 2 |
t2.medium | 2 | 24 | 4 |
t2.large | 2 | 36 | 6 |
t2.xlarge | 4 | 54 | 6 |
t2.2xlarge | 8 | 81 | 32 |
M4 is the instance type you use when in doubt. Developers who sign up for AWS get this instance type for up to a year. This tier has six subtypes:
Instance Type | vCPUs | Memory (GiB) | Dedicated Bandwidth |
m4.large | 2 | 8 | 450 mbps |
m4.xlarge | 4 | 16 | 750 mbps |
m4.2xlarge | 8 | 32 | 1,000 mbps |
m4.4xlarge | 16 | 64 | 2,000 mbps |
m4.10xlarge | 40 | 160 | 4,000 mbps |
m4.16xlarge | 64 | 256 | 10,000 mbps |
This tier has five subtypes:
Instance Type | vCPUs | Memory (GiB) | Dedicated Bandwidth |
c4.large | 2 | 3.75 | 500 mbps |
c4.xlarge | 4 | 7.5 | 750 mbps |
c4.2xlarge | 8 | 15 | 1,000 mbps |
c4.4xlarge | 16 | 30 | 2,000 mbps |
c4.8xlarge | 36 | 60 | 4,000 mbps |
This tier has two subtypes:
Instance Type | vCPUs | Memory (GiB) | Dedicated Bandwidth |
x1.16xlarge | 2 | 8 | 450 mbps |
x1.32xlarge | 4 | 16 | 750 mbps |
This tier has six subtypes:
Instance Type | vCPUs | Memory (GiB) | Dedicated Bandwidth |
r4.large | 2 | 15.25 | 10 gbps |
r4.xlarge | 4 | 30.5 | 10 gbps |
r4.2xlarge | 8 | 61 | 10 gbps |
r4.4xlarge | 16 | 122 | 10 gbps |
r4.8xlarge | 32 | 244 | 10 gbps |
r4.16xlarge | 64 | 488 | 20 gbps |
This tier has three subtypes:
Instance Type | vCPUs | Memory (GiB) | GPUs | GPU Memory (GiB) |
p2.xlarge | 4 | 61 | 1 | 12 |
p2.8xlarge | 32 | 488 | 8 | 96 |
p2.16xlarge | 64 | 732 | 16 | 192 |
This tier has six subtypes:
Instance Type | vCPUs | Memory (GiB) | Storage (GB) |
i3.large | 2 | 15.25 | 475 NVMe SSD |
i3.xlarge | 4 | 30.5 | 950 NVMe SSD |
i3.2xlarge | 8 | 61 | 1,900 NVMe SSD |
i3.4xlarge | 16 | 122 | 2x1,900 NVMe SSD |
i3.8xlarge | 32 | 244 | 4x1,900 NVMe SSD |
i3.16xlarge | 64 | 488 | 8x1,900 NVMe SSD |