Book Image

Machine Learning with Spark - Second Edition

By : Rajdeep Dua, Manpreet Singh Ghotra
Book Image

Machine Learning with Spark - Second Edition

By: Rajdeep Dua, Manpreet Singh Ghotra

Overview of this book

This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML. Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML. By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business.
Table of Contents (13 chapters)

Supported machine learning algorithms by Spark

The following algorithms are supported by Spark ML:

  • Collaborative filtering
    • Alternating Least Squares (ALS): Collaborative filtering is often used for recommender systems. These techniques aim to fill the missing entries of a user-item association matrix. The spark.mllib currently supports model-based collaborative filtering. In this implementation, users and products are described by a small set of latent factors that can be used to predict missing entries. The spark.mllib uses the ALS algorithm to learn these latent factors.
  • Clustering: This is an unsupervised learning problem where the aim is to group subsets of entities with one another based on the notion of similarity. Clustering is used for exploratory analysis and as a component of a hierarchical supervised learning pipeline. When used in a learning pipeline, distinct classifiers or regression models are trained for each cluster. The following clustering techniques are implemented in Spark:
    • k-means: This is one of the commonly used clustering algorithms that cluster the data points into a predefined number of clusters. It is up to the user to choose the number of clusters. The spark.mllib implementation includes a parallelized variant of the k-means++ method (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
    • Gaussian mixture: A Gaussian Mixture Model (GMM) represents a composite distribution where points are taken from one of the k Gaussian sub-distributions. Each of these distributions has its own probability. The spark.mllib implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.
    • Power Iteration Clustering (PIC): This is a scalable algorithm for clustering vertices of a graph given pairwise similarities as edge properties. It computes a pseudo-eigenvector of the (affinity matrix which is normalized) of the graph using power iteration.

Power iteration is an eigenvalue algorithm. Given a matrix X, the algorithm will produce a numberλ ( eigenvalue) and a non-zero vectorv (the eigenvector), such thatXv = λv.

Pseudo eigenvectors of a matrix can be thought of as the eigenvectors of a nearby matrix. More specifically, pseudo eigenvectors are defined as:

Let A be an n by n matrix. Let E be any matrix such that ||E|| = €. Then the eigenvectors of A + E are defined to be pseudo-eigenvectors of A. This eigenvector uses it to cluster graph vertices.

The spark.mllib includes an implementation of PIC using GraphX. It takes an RDD of tuples and outputs a model with the clustering assignments. The similarities must be non-negative. PIC makes the assumption that the similarity measure is symmetric.

(In statistics, a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. Such measures are inverse of distance metrics; an example of this is the Cosine similarity)

A pair (srcId, dstId) regardless of the ordering should appear at the most once in the input data.

    • Latent Dirichlet Allocation (LDA): This is a form of a topic model that infers topics from a collection of text documents. LDA is a form clustering algorithm. The following points explain the topics:

Topics are cluster centers and documents correspond to examples in a dataset Topics and documents both exist in a feature space, where feature vectors are vectors of word counts ( also known as bag of words)
Instead of estimating a clustering using a traditional distance approach, LDA uses a function based on a model of how text documents are generated

    • Bisecting k-means: This is a type of hierarchical clustering. Hierarchical Cluster Analysis (HCA) is a method of cluster analysis that builds a hierarchy of clusterstop down. In this approach, all observations start in one cluster and splits are performed recursively as one moves down the hierarchy.

Hierarchical clustering is one of the commonly used methods of cluster analysis that seek to build a hierarchy of clusters.

    • Streaming k-means: When data arrives in a stream, we want to estimate clusters dynamically and update them as new data arrives. The spark.mllib supports streaming k-means clustering, with parameters to control the decay of the estimates. The algorithm uses a generalization of the mini-batch k-means update rule.
  • Classification
    • Decision Trees: Decision trees and their ensembles are one of the methods for classification and regression. Decision trees are popular as they are easy to interpret, handle categorical features, and extend to the multiclass classification setting. They do not require feature scaling and are also able to capture non-linearities and feature interactions. Tree ensemble algorithms, random forests and boosting are among the top performers for classification and regression scenarios.

The spark.mllib implements decision trees for binary and multiclass classification and regression. It supports both continuous and categorical features. The implementation partitions data by rows, which allows distributed training with millions of instances.

    • Naive Bayes: Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem (https://en.wikipedia.org/wiki/Bayes%27_theorem) with strong (naive) independence assumptions between the features.

Naive Bayes is a multiclass classification algorithm with the assumption of independence between every pair of features. In a single pass of training data, the algorithm computes the conditional probability distribution of each feature given the label, and then it applies Bayes' theorem to compute the conditional probability distribution of a label given an observation, which is then used for prediction. The spark.mllib supports multinomial naive Bayes and Bernoulli Naive Bayes. These models are generally used for document classification.

    • Probability Classifier: In machine learning, a probabilistic classifier is a classifier that can predict, given an input, a probability distribution over a set of classes, rather than outputting the most likely class that the sample should belong to. Probabilistic classifiers provide classification with some certainty, which can be useful on its own or when combining classifiers into ensembles.
    • Logistical Regression: This is a method used to predict a binary response. Logistic regression measures the relationship between the categorical dependent variable and independent variables by estimating probabilities using a logistical function. This function is a cumulative logistic distribution.

It is a special case of Generalized Linear Models (GLM) that predicts the probability of the outcome. For more background and more details about the implementation, refer to the documentation on the logistic regression in spark.mllib.

GLM is considered a generalization of linear regression that allows for response variables that have an error distribution other than a normal distribution.

    • Random Forest: This algorithms use ensembles of decision trees to decide decision boundaries. Random forests combine many decision trees. This reduces the risk of overfitting the result.

Spark ML supports random forest for binary and multi-class classification as well as regression. It can use used for continuous or categorical values.

  • Dimensionality reduction: This is the process of reducing the number of variables on which machine learning will be done. It can be used to extract latent features from raw features or to compress data while maintaining the overall structure. MLlib provides support dimensionality reduction on top of the RowMatrix class.
    • Singular value decomposition (SVD): Singular value decomposition of a matrix M: m x n (real or complex) is a factorization of the form UΣV*, where U is anm x R matrix. Σ is an R x R rectangular diagonal matrix with non-negative real numbers on the diagonal, and V is an n x r unitary matrix. r is equal to the rank of the matrix M.
    • Principal component analysis (PCA): This is a statistical method used to find a rotation to find largest variance in the first coordinate. Each succeeding coordinate, in turn, has the largest variance possible. The columns of the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.

MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format using RowMatrix.
Spark supports features extraction and transforation using TF-IDF, ChiSquare, Selector, Normalizer, and Word2Vector.

  • Frequent pattern mining:
    • FP-growth: FP stands for frequent pattern. Algorithm first counts item occurrences (attribute and value pairs) in the dataset and stores them in the header table.

In the second pass, the algorithm builds the FP-tree structure by inserting instances (made of items). Items in each instance are sorted by descending order of their frequency in the dataset; this ensures that the tree can be processed quickly. Items in each instance that do not meet minimum coverage threshold are discarded. For a use case where many instances share most frequent items, the FP-tree provides high compression close to the tree root.

    • Association rules: Association rule learning is a mechanism for discovering interesting relations between variables in large databases.

It implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent.

  • PrefixSpan: This is a sequential pattern mining algorithm.
  • Evaluation metrics: The spark.mllib comes with a suite of metrics for evaluating the algorithms.
  • PMML model export: The Predictive Model Markup Language (PMML) is an XML-based predictive model interchange format. PMML provides a mechanism for analytic applications to describe and exchange predictive models produced by machine learning algorithms.

The spark.mllib allows the export of its machine learning models to PMML and their equivalent PMML models.

  • Optimization (Developer)
    • Stochastic Gradient Descent: This is used to optimize gradient descent to minimize an objective function; this function is a sum of differentiable functions.

Gradient descent methods and the Stochastic Subgradient Descent (SGD) are included as a low-level primitive in MLlib, on top of which various ML algorithms are developed.

  • Limited-Memory BFGS (L-BFGS): This is an optimization algorithm and belongs to the family of quasi-Newton methods that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. It uses a limited amount of computer memory. It is used for parameter estimation in machine learning.

The BFGS method approximates Newton's method, which is a class of hill-climbing optimization techniques that seeks a stationary point of a function. For such problems, a necessary optimal condition is that the gradient should be zero.