Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Launching Spark on Amazon EC2


Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute instances in the cloud. Amazon EC2 provides the following features:

  • On-demand delivery of IT resources via the Internet

  • The provision of as many instances as you like

  • Payment for the hours you use instances like your utility bill

  • No setup cost, no installation, and no overhead at all

  • When you no longer need instances, you either shut down or terminate and walk away

  • The availability of these instances on all familiar operating systems

EC2 provides different types of instances to meet all compute needs, such as general-purpose instances, micro instances, memory-optimized instances, storage-optimized instances, and others. They have a free tier of micro-instances to try.

Getting ready

The spark-ec2 script comes bundled with Spark and makes it easy to launch, manage, and shut down clusters on Amazon EC2.

Before you start, you need to do the following things:

  1. Log in to the Amazon AWS account (http://aws.amazon.com).

  2. Click on Security Credentials under your account name in the top-right corner.

  3. Click on Access Keys and Create New Access Key:

  4. Note down the access key ID and secret access key.

  5. Now go to Services | EC2.

  6. Click on Key Pairs in left-hand menu under NETWORK & SECURITY.

  7. Click on Create Key Pair and enter kp-spark as key-pair name:

  8. Download the private key file and copy it in the /home/hduser/keypairs folder.

  9. Set permissions on key file to 600.

  10. Set environment variables to reflect access key ID and secret access key (please replace sample values with your own values):

    $ echo "export AWS_ACCESS_KEY_ID=\"AKIAOD7M2LOWATFXFKQ\"" >> /home/hduser/.bashrc
    $ echo "export AWS_SECRET_ACCESS_KEY=\"+Xr4UroVYJxiLiY8DLT4DLT4D4sxc3ijZGMx1D3pfZ2q\"" >> /home/hduser/.bashrc
    $ echo "export PATH=$PATH:/opt/infoobjects/spark/ec2" >> /home/hduser/.bashrc
    

How to do it...

  1. Spark comes bundled with scripts to launch the Spark cluster on Amazon EC2. Let's launch the cluster using the following command:

    $ cd /home/hduser
    $ spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
    
  2. Launch the cluster with the example value:

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2  -s 3 launch spark-cluster
    

    Note

    • <key-pair>: This is the name of EC2 key-pair created in AWS

    • <key-file>: This is the private key file you downloaded

    • <num-slaves>: This is the number of slave nodes to launch

    • <cluster-name>: This is the name of the cluster

  3. Sometimes, the default availability zones are not available; in that case, retry sending the request by specifying the specific availability zone you are requesting:

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2  -s 3 launch spark-cluster
    
  4. If your application needs to retain data after the instance shuts down, attach EBS volume to it (for example, a 10 GB space):

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster
    
  5. If you use Amazon spot instances, here's the way to do it:

    $ spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2  -s 3 launch spark-cluster
    

    Note

    Spot instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current spot price, which varies in real-time based on supply and demand (source: amazon.com).

  6. After everything is launched, check the status of the cluster by going to the web UI URL that will be printed at the end.

  7. Check the status of the cluster:

  8. Now, to access the Spark cluster on EC2, let's connect to the master node using secure shell protocol (SSH):

    $ spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem  login spark-cluster
    

    You should get something like the following:

  9. Check directories in the master node and see what they do:

    Directory

    Description

    ephemeral-hdfs

    This is the Hadoop instance for which data is ephemeral and gets deleted when you stop or restart the machine.

    persistent-hdfs

    Each node has a very small amount of persistent storage (approximately 3 GB). If you use this instance, data will be retained in that space.

    hadoop-native

    These are native libraries to support Hadoop, such as snappy compression libraries.

    Scala

    This is Scala installation.

    shark

    This is Shark installation (Shark is no longer supported and is replaced by Spark SQL).

    spark

    This is Spark installation

    spark-ec2

    These are files to support this cluster deployment.

    tachyon

    This is Tachyon installation

  10. Check the HDFS version in an ephemeral instance:

    $ ephemeral-hdfs/bin/hadoop version
    Hadoop 2.0.0-chd4.2.0
    
  11. Check the HDFS version in persistent instance with the following command:

    $ persistent-hdfs/bin/hadoop version
    Hadoop 2.0.0-chd4.2.0
    
  12. Change the configuration level in logs:

    $ cd spark/conf
    
  13. The default log level information is too verbose, so let's change it to Error:

    1. Create the log4.properties file by renaming the template:

      $ mv log4j.properties.template log4j.properties
      
    2. Open log4j.properties in vi or your favorite editor:

      $ vi log4j.properties
      
    3. Change second line from | log4j.rootCategory=INFO, console to | log4j.rootCategory=ERROR, console.

  14. Copy the configuration to all slave nodes after the change:

    $ spark-ec2/copydir spark/conf
    

    You should get something like this:

  15. Destroy the Spark cluster:

    $ spark-ec2 destroy spark-cluster