Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:
EC2 provides different types of instances to meet all compute needs, such as general-purpose instances, micro instances, memory-optimized instances, storage-optimized instances, and others. They have a free tier of micro-instances to try.
The spark-ec2
script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.
Before you start, you need to do the following things:
Log in to the Amazon AWS account (http://aws.amazon.com).
Click on Security Credentials under your account name in the top-right corner.
Click on Access Keys and Create New Access Key:
Note down the access key ID and secret access key.
Now go to Services | EC2.
Click on Key Pairs in left-hand menu under NETWORK & SECURITY.
Click on Create Key Pair and enter
kp-spark
as key-pair name:Download the private key file and copy it in the
/home/hduser/keypairs folder
.Set environment variables to reflect access key ID and secret access key (please replace sample values with your own values):
$ echo "export AWS_ACCESS_KEY_ID=\"AKIAOD7M2LOWATFXFKQ\"" >> /home/hduser/.bashrc $ echo "export AWS_SECRET_ACCESS_KEY=\"+Xr4UroVYJxiLiY8DLT4DLT4D4sxc3ijZGMx1D3pfZ2q\"" >> /home/hduser/.bashrc $ echo "export PATH=$PATH:/opt/infoobjects/spark/ec2" >> /home/hduser/.bashrc
Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let's launch the cluster using the following command:
$ cd /home/hduser $ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
Launch the cluster with the example value:
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -s 3 launch spark-cluster
Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting:
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2 -s 3 launch spark-cluster
If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, a 10 GB space):
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster
If you use Amazon spot instances, here's the way to do it:
$ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2 -s 3 launch spark-cluster
Note
Spot instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real-time based on supply and demand (source: amazon.com).
After everything is launched, check the status of the cluster by going to the web UI URL that will be printed at the end.
Check the status of the cluster:
Now, to access the Spark cluster on EC2, let's connect to the master node using secure shell protocol (SSH):
$ spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem login spark-cluster
You should get something like the following:
Check directories in the master node and see what they do:
Check the HDFS version in an ephemeral instance:
$ ephemeral-hdfs/bin/hadoop version Hadoop 2.0.0-chd4.2.0
Check the HDFS version in persistent instance with the following command:
$ persistent-hdfs/bin/hadoop version Hadoop 2.0.0-chd4.2.0
Change the configuration level in logs:
$ cd spark/conf
The default log level information is too verbose, so let's change it to Error:
Copy the configuration to all slave nodes after the change:
$ spark-ec2/copydir spark/conf
You should get something like this:
Destroy the Spark cluster:
$ spark-ec2 destroy spark-cluster