Machine Learning with Spark - Second Edition

By : Rajdeep Dua, Manpreet Singh Ghotra

Machine Learning with Spark - Second Edition

By: Rajdeep Dua, Manpreet Singh Ghotra

Overview of this book

This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML. Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML. By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business.

Preface

What this book covers

What you need for this book

Free Chapter

Getting Up and Running with Spark

Installing and setting up Spark locally

Spark clusters

The Spark programming model

SchemaRDD

Spark data frame

The first step to a Spark program in Scala

The first step to a Spark program in Java

The first step to a Spark program in Python

The first step to a Spark program in R

Getting Spark running on Amazon EC2

Configuring and running Spark on Amazon Elastic Map Reduce

UI in Spark

Supported machine learning algorithms by Spark

Benefits of using Spark ML as compared to existing libraries

Spark Cluster on Google Compute Engine - DataProc

Summary

Math for Machine Learning

Linear algebra

Gradient descent

Prior, likelihood, and posterior

Calculus

Plotting

Summary

Designing a Machine Learning System

What is Machine Learning?

Introducing MovieStream

Business use cases for a machine learning system

Types of machine learning models

The components of a data-driven machine learning system

An architecture for a machine learning system

Spark MLlib

Performance improvements in Spark ML over Spark MLlib

Comparing algorithms supported by MLlib

MLlib supported methods and developer APIs

MLlib vision

MLlib versions compared

Summary

Obtaining, Processing, and Preparing Data with Spark

Accessing publicly available datasets

Exploring and visualizing your data

Processing and transforming your data

Extracting useful features from your data

Summary

Building a Recommendation Engine with Spark

Types of recommendation models

Extracting the right features from your data

Training the recommendation model

Using the recommendation model

Evaluating the performance of recommendation models

FP-Growth algorithm

Summary

Building a Classification Model with Spark

Types of classification models

Extracting the right features from your data

Training classification models

Using classification models

Improving model performance and tuning parameters

Additional features

Summary

Building a Regression Model with Spark

Types of regression models

Evaluating the performance of regression models

Extracting the right features from your data

Training and using regression models

Improving model performance and tuning parameters

Summary

Building a Clustering Model with Spark

Types of clustering models

Extracting the right features from your data

K-means - training a clustering model

K-means - evaluating the performance of clustering models

Effect of iterations on WSSSE

Bisecting KMeans

Bisecting K-means - training a clustering model

Gaussian Mixture Model

Summary

Dimensionality Reduction with Spark

Types of dimensionality reduction

Extracting the right features from your data

Training a dimensionality reduction model

Using a dimensionality reduction model

Evaluating dimensionality reduction models

Summary

Advanced Text Processing with Spark

What's so special about text data?

Extracting the right features from your data

Using a tf-idf model

Evaluating the impact of text processing

Text classification with Spark 2.0

Word2Vec models

Word2Vec with Spark ML on the 20 Newsgroups dataset

Summary

Real-Time Machine Learning with Spark Streaming

Online learning

Stream processing

Online learning with Spark Streaming

Online model evaluation

Structured Streaming

Summary

Pipeline APIs for Spark ML

Introduction to pipelines

How pipelines work

Machine learning pipeline with an example

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Getting Spark running on Amazon EC2

The Spark project provides scripts to run a Spark cluster in the cloud on Amazon's EC2 service. These scripts are located in the ec2 directory. You can run the spark-ec2 script contained in this directory with the following command:

>./ec2/spark-ec2

Running it in this way without an argument will show the help output:

Usage: spark-ec2 [options] <action> <cluster_name>
<action> can be: launch, destroy, login, stop, start, get-master

Options:
...

Before creating a Spark EC2 cluster, you will need to ensure that you have an
Amazon account.

If you don't have an Amazon Web Services account, you can sign up at http://aws.amazon.com/.
The AWS console is available at http://aws.amazon.com/console/.

You will also need to create an Amazon EC2 key pair and retrieve the relevant security credentials. The Spark documentation for EC2 (available at http://spark.apache.org/docs/latest/ec2-scripts.html) explains the requirements:

Create an Amazon EC2 key pair for yourself. This can be done by logging into your Amazon Web Services account through the AWS console, clicking on Key Pairs on the left sidebar, and creating and downloading a key. Make sure that you set the permissions for the private key file to 600 (that is, only you can read and write it) so that ssh will work.

Whenever you want to use the spark-ec2 script, set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key, respectively. These can be obtained from the AWS homepage by clicking Account | Security Credentials | Access Credentials.

When creating a key pair, choose a name that is easy to remember. We will simply use the name spark for the key pair. The key pair file itself will be called spark.pem. As mentioned earlier, ensure that the key pair file permissions are set appropriately and that the environment variables for the AWS credentials are exported using the following commands:

  $ chmod 600 spark.pem
  $ export AWS_ACCESS_KEY_ID="..."
  $ export AWS_SECRET_ACCESS_KEY="..."

You should also be careful to keep your downloaded key pair file safe and not lose it, as it can only be downloaded once when it is created!

Note that launching an Amazon EC2 cluster in the following section will incur costs to your AWS account.

Launching an EC2 Spark cluster

We're now ready to launch a small Spark cluster by changing into the ec2 directory and then running the cluster launch command:

 $  cd ec2
 $ ./spark-ec2 --key-pair=rd_spark-user1 --identity-file=spark.pem  
    --region=us-east-1 --zone=us-east-1a launch my-spark-cluster

This will launch a new Spark cluster called test-cluster with one master and one slave node of instance type m3.medium. This cluster will be launched with a Spark version built for Hadoop 2. The key pair name we used is spark, and the key pair file is spark.pem (if you gave the files different names or have an existing AWS key pair, use that name instead).

It might take quite a while for the cluster to fully launch and initialize. You should see something like the following immediately after running the launch command:

Setting up security groups...
Creating security group my-spark-cluster-master
Creating security group my-spark-cluster-slaves
Searching for existing cluster my-spark-cluster in region 
    us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Launched 1 slave in us-east-1a, regid = r-5a893af2
Launched master in us-east-1a, regid = r-39883b91
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state...........
Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute- 
    1.amazonaws.com port 22: Connection refused
Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute-
    1.amazonaws.com port 22: Connection refused
Warnig: SSH connection error. (This could be temporary.)
Host: ec2-52-90-110-128.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-110-128.compute-
    1.amazonaws.com port 22: Connection refused
Cluster is now in 'ssh-ready' state. Waited 510 seconds.

If the cluster has launched successfully, you should eventually see a console output similar to the following listing:

./tachyon/setup.sh: line 5: /root/tachyon/bin/tachyon: 
    No such file or directory
./tachyon/setup.sh: line 9: /root/tachyon/bin/tachyon-start.sh: 
    No such file or directory
[timing] tachyon setup:  00h 00m 01s
Setting up rstudio
spark-ec2/setup.sh: line 110: ./rstudio/setup.sh: 
    No such file or directory
[timing] rstudio setup:  00h 00m 00s
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...
ec2-52-91-214-206.compute-1.amazonaws.com
Shutting down GANGLIA gmond:                               [FAILED]
Starting GANGLIA gmond:                                    [  OK  ]
Shutting down GANGLIA gmond:                               [FAILED]
Starting GANGLIA gmond:                                    [  OK  ]
Connection to ec2-52-91-214-206.compute-1.amazonaws.com closed.
Shutting down GANGLIA gmetad:                              [FAILED]
Starting GANGLIA gmetad:                                   [  OK  ]
Stopping httpd:                                            [FAILED]
Starting httpd: httpd: Syntax error on line 154 of /etc/httpd
    /conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so 
    into server: /etc/httpd/modules/mod_authz_core.so: cannot open 
    shared object file: No such file or directory              [FAILED]
[timing] ganglia setup:  00h 00m 03s
Connection to ec2-52-90-110-128.compute-1.amazonaws.com closed.
Spark standalone cluster started at 
    http://ec2-52-90-110-128.compute-1.amazonaws.com:8080
Ganglia started at http://ec2-52-90-110-128.compute-
    1.amazonaws.com:5080/ganglia
Done!
ubuntu@ubuntu:~/work/spark-1.6.0-bin-hadoop2.6/ec2$

This will create two VMs - Spark Master and Spark Slave of type m1.large as shown in the following screenshot :

To test whether we can connect to our new cluster, we can run the following command:

  $ ssh -i spark.pem root@ ec2-52-90-110-128.compute-1.amazonaws.com

Remember to replace the public domain name of the master node (the address after root@ in the preceding command) with the correct Amazon EC2 public domain name that will be shown in your console output after launching the cluster.

You can also retrieve your cluster's master public domain name by running this line of code:

  $ ./spark-ec2 -i spark.pem get-master test-cluster

After successfully running the ssh command, you will be connected to your Spark master node in EC2, and your terminal output should match the following screenshot:

We can test whether our cluster is correctly set up with Spark by changing into the Spark directory and running an example in the local mode:

  $ cd spark
  $ MASTER=local[2] ./bin/run-example SparkPi

You should see output similar to what you would get on running the same command on your local computer:

...
14/01/30 20:20:21 INFO SparkContext: Job finished: reduce at 
SparkPi.scala:35, took 0.864044012 s
Pi is roughly 3.14032
...

Now that we have an actual cluster with multiple nodes, we can test Spark in the cluster mode. We can run the same example on the cluster, using our one slave node by passing in the master URL instead of the local version:

    $ MASTER=spark:// ec2-52-90-110-128.compute-
      1.amazonaws.com:7077 ./bin/run-example SparkPi

Note that you will need to substitute the preceding master domain name with the correct domain name for your specific cluster.

Again, the output should be similar to running the example locally; however, the log messages will show that your driver program has connected to the Spark master:

...
14/01/30 20:26:17 INFO client.Client$ClientActor: Connecting to 
    master spark://ec2-54-220-189-136.eu-
    west-1.compute.amazonaws.com:7077
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend: 
    Connected to Spark cluster with app ID app-20140130202617-0001
14/01/30 20:26:17 INFO client.Client$ClientActor: Executor added: 
    app-20140130202617-0001/0 on worker-20140130201049-
    ip-10-34-137-45.eu-west-1.compute.internal-57119 
    (ip-10-34-137-45.eu-west-1.compute.internal:57119) with 1 cores
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend:
    Granted executor ID app-20140130202617-0001/0 on hostPort 
    ip-10-34-137-45.eu-west-1.compute.internal:57119 with 1 cores, 
    2.4 GB RAM
14/01/30 20:26:17 INFO client.Client$ClientActor: 
    Executor updated: app-20140130202617-0001/0 is now RUNNING
14/01/30 20:26:18 INFO spark.SparkContext: Starting job: reduce at 
    SparkPi.scala:39
...

Feel free to experiment with your cluster. Try out the interactive console in Scala, for example:

  $ ./bin/spark-shell --master spark:// ec2-52-90-110-128.compute-
    1.amazonaws.com:7077

Once you've finished, type exit to leave the console. You can also try the PySpark console by running the following command:

  $ ./bin/pyspark --master spark:// ec2-52-90-110-128.compute-
    1.amazonaws.com:7077

You can use the Spark Master web interface to see the applications registered with the master. To load the Master Web UI, navigate to ec2-52-90-110-128.compute-1.amazonaws.com:8080 (again, remember to replace this domain name with your own master domain name).

Remember that you will be charged by Amazon for usage of the cluster. Don't forget to stop or terminate this test cluster once you're done with it. To do this, you can first exit the ssh session by typing exit to return to your own local system and then run the following command:

  $ ./ec2/spark-ec2 -k spark -i spark.pem destroy test-cluster

You should see the following output:

Are you sure you want to destroy the cluster test-cluster?
The following instances will be terminated:
Searching for existing cluster test-cluster...
Found 1 master(s), 1 slaves
> ec2-54-227-127-14.compute-1.amazonaws.com
> ec2-54-91-61-225.compute-1.amazonaws.com
ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster test-cluster (y/N): y
Terminating master...
Terminating slaves...

Hit Y and then Enter to destroy the cluster.

Congratulations! You've just set up a Spark cluster in the cloud, run a fully parallel example program on this cluster, and terminated it. If you would like to try out any of the example code in the subsequent chapters (or your own Spark programs) on a cluster, feel free to experiment with the Spark EC2 scripts and launch a cluster of your chosen size and instance profile. (Just be mindful of the costs and remember to shut it down when you're done!)

Machine Learning with Spark - Second Edition

By : Rajdeep Dua, Manpreet Singh Ghotra

Machine Learning with Spark - Second Edition

By: Rajdeep Dua, Manpreet Singh Ghotra

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning with Spark - Second Edition

Kali Linux Intrusion and Exploitation Cookbook

Apache Spark 2.x Machine Learning Cookbook

Learning Spark SQL