Machine Learning with Spark - Second Edition

By : Rajdeep Dua, Manpreet Singh Ghotra

Machine Learning with Spark - Second Edition

By: Rajdeep Dua, Manpreet Singh Ghotra

Overview of this book

This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML. Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML. By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business.

Preface

What this book covers

What you need for this book

Free Chapter

Getting Up and Running with Spark

Installing and setting up Spark locally

Spark clusters

The Spark programming model

SchemaRDD

Spark data frame

The first step to a Spark program in Scala

The first step to a Spark program in Java

The first step to a Spark program in Python

The first step to a Spark program in R

Getting Spark running on Amazon EC2

Configuring and running Spark on Amazon Elastic Map Reduce

UI in Spark

Supported machine learning algorithms by Spark

Benefits of using Spark ML as compared to existing libraries

Spark Cluster on Google Compute Engine - DataProc

Summary

Math for Machine Learning

Linear algebra

Gradient descent

Prior, likelihood, and posterior

Calculus

Plotting

Summary

Designing a Machine Learning System

What is Machine Learning?

Introducing MovieStream

Business use cases for a machine learning system

Types of machine learning models

The components of a data-driven machine learning system

An architecture for a machine learning system

Spark MLlib

Performance improvements in Spark ML over Spark MLlib

Comparing algorithms supported by MLlib

MLlib supported methods and developer APIs

MLlib vision

MLlib versions compared

Summary

Obtaining, Processing, and Preparing Data with Spark

Accessing publicly available datasets

Exploring and visualizing your data

Processing and transforming your data

Extracting useful features from your data

Summary

Building a Recommendation Engine with Spark

Types of recommendation models

Extracting the right features from your data

Training the recommendation model

Using the recommendation model

Evaluating the performance of recommendation models

FP-Growth algorithm

Summary

Building a Classification Model with Spark

Types of classification models

Extracting the right features from your data

Training classification models

Using classification models

Improving model performance and tuning parameters

Additional features

Summary

Building a Regression Model with Spark

Types of regression models

Evaluating the performance of regression models

Extracting the right features from your data

Training and using regression models

Improving model performance and tuning parameters

Summary

Building a Clustering Model with Spark

Types of clustering models

Extracting the right features from your data

K-means - training a clustering model

K-means - evaluating the performance of clustering models

Effect of iterations on WSSSE

Bisecting KMeans

Bisecting K-means - training a clustering model

Gaussian Mixture Model

Summary

Dimensionality Reduction with Spark

Types of dimensionality reduction

Extracting the right features from your data

Training a dimensionality reduction model

Using a dimensionality reduction model

Evaluating dimensionality reduction models

Summary

Advanced Text Processing with Spark

What's so special about text data?

Extracting the right features from your data

Using a tf-idf model

Evaluating the impact of text processing

Text classification with Spark 2.0

Word2Vec models

Word2Vec with Spark ML on the 20 Newsgroups dataset

Summary

Real-Time Machine Learning with Spark Streaming

Online learning

Stream processing

Online learning with Spark Streaming

Online model evaluation

Structured Streaming

Summary

Pipeline APIs for Spark ML

Introduction to pipelines

How pipelines work

Machine learning pipeline with an example

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Supported machine learning algorithms by Spark

The following algorithms are supported by Spark ML:

Collaborative filtering
- Alternating Least Squares (ALS): Collaborative filtering is often used for recommender systems. These techniques aim to fill the missing entries of a user-item association matrix. The spark.mllib currently supports model-based collaborative filtering. In this implementation, users and products are described by a small set of latent factors that can be used to predict missing entries. The spark.mllib uses the ALS algorithm to learn these latent factors.

Clustering: This is an unsupervised learning problem where the aim is to group subsets of entities with one another based on the notion of similarity. Clustering is used for exploratory analysis and as a component of a hierarchical supervised learning pipeline. When used in a learning pipeline, distinct classifiers or regression models are trained for each cluster. The following clustering techniques are implemented in Spark:
- k-means: This is one of the commonly used clustering algorithms that cluster the data points into a predefined number of clusters. It is up to the user to choose the number of clusters. The spark.mllib implementation includes a parallelized variant of the k-means++ method (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
- Gaussian mixture: A Gaussian Mixture Model (GMM) represents a composite distribution where points are taken from one of the k Gaussian sub-distributions. Each of these distributions has its own probability. The spark.mllib implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.
- Power Iteration Clustering (PIC): This is a scalable algorithm for clustering vertices of a graph given pairwise similarities as edge properties. It computes a pseudo-eigenvector of the (affinity matrix which is normalized) of the graph using power iteration.

Power iteration is an eigenvalue algorithm. Given a matrix X, the algorithm will produce a numberλ ( eigenvalue) and a non-zero vectorv (the eigenvector), such thatXv = λv.

Pseudo eigenvectors of a matrix can be thought of as the eigenvectors of a nearby matrix. More specifically, pseudo eigenvectors are defined as:

Let A be an n by n matrix. Let E be any matrix such that ||E|| = €. Then the eigenvectors of A + E are defined to be pseudo-eigenvectors of A. This eigenvector uses it to cluster graph vertices.

The spark.mllib includes an implementation of PIC using GraphX. It takes an RDD of tuples and outputs a model with the clustering assignments. The similarities must be non-negative. PIC makes the assumption that the similarity measure is symmetric.

(In statistics, a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. Such measures are inverse of distance metrics; an example of this is the Cosine similarity)

A pair (srcId, dstId) regardless of the ordering should appear at the most once in the input data.

- Latent Dirichlet Allocation (LDA): This is a form of a topic model that infers topics from a collection of text documents. LDA is a form clustering algorithm. The following points explain the topics:

Topics are cluster centers and documents correspond to examples in a dataset Topics and documents both exist in a feature space, where feature vectors are vectors of word counts ( also known as bag of words)
Instead of estimating a clustering using a traditional distance approach, LDA uses a function based on a model of how text documents are generated

- Bisecting k-means: This is a type of hierarchical clustering. Hierarchical Cluster Analysis (HCA) is a method of cluster analysis that builds a hierarchy of clusterstop down. In this approach, all observations start in one cluster and splits are performed recursively as one moves down the hierarchy.

Hierarchical clustering is one of the commonly used methods of cluster analysis that seek to build a hierarchy of clusters.

- Streaming k-means: When data arrives in a stream, we want to estimate clusters dynamically and update them as new data arrives. The spark.mllib supports streaming k-means clustering, with parameters to control the decay of the estimates. The algorithm uses a generalization of the mini-batch k-means update rule.
Classification
- Decision Trees: Decision trees and their ensembles are one of the methods for classification and regression. Decision trees are popular as they are easy to interpret, handle categorical features, and extend to the multiclass classification setting. They do not require feature scaling and are also able to capture non-linearities and feature interactions. Tree ensemble algorithms, random forests and boosting are among the top performers for classification and regression scenarios.

The spark.mllib implements decision trees for binary and multiclass classification and regression. It supports both continuous and categorical features. The implementation partitions data by rows, which allows distributed training with millions of instances.

- Naive Bayes: Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem (https://en.wikipedia.org/wiki/Bayes%27_theorem) with strong (naive) independence assumptions between the features.

Naive Bayes is a multiclass classification algorithm with the assumption of independence between every pair of features. In a single pass of training data, the algorithm computes the conditional probability distribution of each feature given the label, and then it applies Bayes' theorem to compute the conditional probability distribution of a label given an observation, which is then used for prediction. The spark.mllib supports multinomial naive Bayes and Bernoulli Naive Bayes. These models are generally used for document classification.

- Probability Classifier: In machine learning, a probabilistic classifier is a classifier that can predict, given an input, a probability distribution over a set of classes, rather than outputting the most likely class that the sample should belong to. Probabilistic classifiers provide classification with some certainty, which can be useful on its own or when combining classifiers into ensembles.
- Logistical Regression: This is a method used to predict a binary response. Logistic regression measures the relationship between the categorical dependent variable and independent variables by estimating probabilities using a logistical function. This function is a cumulative logistic distribution.

It is a special case of Generalized Linear Models (GLM) that predicts the probability of the outcome. For more background and more details about the implementation, refer to the documentation on the logistic regression in spark.mllib.

GLM is considered a generalization of linear regression that allows for response variables that have an error distribution other than a normal distribution.

- Random Forest: This algorithms use ensembles of decision trees to decide decision boundaries. Random forests combine many decision trees. This reduces the risk of overfitting the result.

Spark ML supports random forest for binary and multi-class classification as well as regression. It can use used for continuous or categorical values.

Dimensionality reduction: This is the process of reducing the number of variables on which machine learning will be done. It can be used to extract latent features from raw features or to compress data while maintaining the overall structure. MLlib provides support dimensionality reduction on top of the RowMatrix class.

- Singular value decomposition (SVD): Singular value decomposition of a matrix M: m x n (real or complex) is a factorization of the form UΣV*, where U is anm x R matrix. Σ is an R x R rectangular diagonal matrix with non-negative real numbers on the diagonal, and V is an n x r unitary matrix. r is equal to the rank of the matrix M.
- Principal component analysis (PCA): This is a statistical method used to find a rotation to find largest variance in the first coordinate. Each succeeding coordinate, in turn, has the largest variance possible. The columns of the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.

MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format using RowMatrix.
Spark supports features extraction and transforation using TF-IDF, ChiSquare, Selector, Normalizer, and Word2Vector.

Frequent pattern mining:
- FP-growth: FP stands for frequent pattern. Algorithm first counts item occurrences (attribute and value pairs) in the dataset and stores them in the header table.

In the second pass, the algorithm builds the FP-tree structure by inserting instances (made of items). Items in each instance are sorted by descending order of their frequency in the dataset; this ensures that the tree can be processed quickly. Items in each instance that do not meet minimum coverage threshold are discarded. For a use case where many instances share most frequent items, the FP-tree provides high compression close to the tree root.

- Association rules: Association rule learning is a mechanism for discovering interesting relations between variables in large databases.

It implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent.

PrefixSpan: This is a sequential pattern mining algorithm.
Evaluation metrics: The spark.mllib comes with a suite of metrics for evaluating the algorithms.
PMML model export: The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format. PMML provides a mechanism for analytic applications to describe and exchange predictive models produced by machine learning algorithms.

The spark.mllib allows the export of its machine learning models to PMML and their equivalent PMML models.

Optimization (Developer)
- Stochastic Gradient Descent: This is used to optimize gradient descent to minimize an objective function; this function is a sum of differentiable functions.

Gradient descent methods and the Stochastic Subgradient Descent (SGD) are included as a low-level primitive in MLlib, on top of which various ML algorithms are developed.

Limited-Memory BFGS (L-BFGS): This is an optimization algorithm and belongs to the family of quasi-Newton methods that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. It uses a limited amount of computer memory. It is used for parameter estimation in machine learning.

The BFGS method approximates Newton's method, which is a class of hill-climbing optimization techniques that seeks a stationary point of a function. For such problems, a necessary optimal condition is that the gradient should be zero.

Machine Learning with Spark - Second Edition

By : Rajdeep Dua, Manpreet Singh Ghotra

Machine Learning with Spark - Second Edition

By: Rajdeep Dua, Manpreet Singh Ghotra

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning with Spark - Second Edition

Kali Linux Intrusion and Exploitation Cookbook

Apache Spark 2.x Machine Learning Cookbook

Learning Spark SQL