Scala for Machine Learning, Second Edition

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies. The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naïve Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You’ll move on to evolutionary computing, multibandit algorithms, and reinforcement learning. Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.

Scala for Machine Learning Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started

Mathematical notations for the curious

Why machine learning?

Why Scala?

Model categorization

Taxonomy of machine learning algorithms

Leveraging Java libraries

Tools and frameworks

Source code

Let's kick the tires

Summary

Data Pipelines

Modeling

Defining a methodology

Monadic data transformation

Workflow computational model

Profiling data

Assessing a model

Summary

Data Preprocessing

Time series in Scala

Moving averages

Fourier analysis

The discrete Kalman filter

Alternative preprocessing techniques

Summary

Unsupervised Learning

K-mean clustering

Expectation-Maximization (EM)

Summary

Dimension Reduction

Challenging model complexity

The divergences

Principal components analysis (PCA)

Nonlinear models

Summary

Naïve Bayes Classifiers

Probabilistic graphical models

Naïve Bayes classifiers

Multivariate Bernoulli classification

Naïve Bayes and text mining

Pros and cons

Summary

Sequential Data Models

Markov decision processes

The hidden Markov model (HMM)

Conditional random fields

Regularized CRF and text analytics

Comparing CRF and HMM

Performance consideration

Summary

Monte Carlo Inference

The purpose of sampling

Gaussian sampling

Monte Carlo approximation

Bootstrapping with replacement

Markov Chain Monte Carlo (MCMC)

Summary

Regression and Regularization

Linear regression

Regularization

Numerical optimization

Logistic regression

Summary

Multilayer Perceptron

Feed-forward neural networks (FFNN)

The multilayer perceptron (MLP)

Evaluation

Benefits and limitations

Summary

Deep Learning

Sparse autoencoder

Restricted Boltzmann Machines (RBMs)

Convolution neural networks

Kernel Models and SVM

Kernel functions

The support vector machine (SVM)

Performance considerations

Summary

Evolutionary Computing

Evolution

Genetic algorithms and machine learning

Genetic algorithm components

Implementation

GA for trading strategies

Advantages and risks of genetic algorithms

Summary

Multiarmed Bandits

K-armed bandit

Thompson sampling

Upper bound confidence

Summary

Reinforcement Learning

Reinforcement learning

Learning classifier systems

Summary

Parallelism in Scala and Akka

Overview

Scala

Scalability with Actors

Akka

Summary

Apache Spark MLlib

Overview

Apache Spark core

MLlib library

Reusable ML pipelines

Extending Spark

Streaming engine

Performance evaluation

Pros and cons

Summary

Basic Concepts

Scala programming

Mathematics

Finances 101

Suggested online courses

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Mathematics

This section describes very briefly some of the mathematical concepts used in the book.

Linear algebra

Many algorithms used in machine learning such as minimization of a convex loss function, principal component analysis, or least squares regression involves invariably manipulation and transformation of matrices. There are many good books on the subject, from the inexpensive [A:2] to the sophisticated [A:3].

QR decomposition

The QR decomposition (also known as QR factorization) is the decomposition of a matrix A into a product of an orthogonal matrix Q and upper triangular matrix R. A=QR and Q^T Q=I [A:4].

The decomposition is unique if A is a real, square, and invertible matrix. In the case of a rectangle matrix A, m by n with m > n the decomposition is implemented as the dot product of two vector of matrices: A = [Q₁ , Q₂ ].[R₁ , R₂ ]^T where Q₁ is an m by n matrix, Q₂ is an m by n matrix, R₁ is n by n and an upper triangle matrix, R₂ is an m by n null matrix.

QR decomposition is a reliable method for solving large systems of linear equations for which the number of equations (rows) exceeds the number of variables (columns). Its asymptotic computational time complexity for a training set of m dimension and n observations is O(mn² -n³ /3).

It is used to minimize the loss function for ordinary least squares regression, which is covered in the Ordinary least squares regression section in Chapter 9, Regression and Regularization.

LU factorization

LU factorization is a technique to solve a matrix equation A.x = b, where A is a nonsingular matrix and x and b are two vectors. The technique consists of decomposing the original matrix A as the product of simple matrix A= A₁ A₂ …A_n.

Basic LU factorization defines A as the product of a lower unit triangular matrix L and an upper triangular matrix, U: A=LU
LU factorization with pivot define A as the product of a permutation matrix, P, a unit lower triangular matrix L and an upper triangular matrix, U: A=PLU

LDL decomposition

LDL decomposition for real matrices defines a real positive matrix A as the product of a lower unit triangular matrix L, a diagonal matrix D, and the transposed matrix of L, L^T A=LDL^T.

Cholesky factorization

The Cholesky factorization (also known as Cholesky decomposition) of real matrices is a special case of LU factorization [A:4]. It decomposes a positive-definite matrix A into a product of lower triangle matrix L and its conjugate transpose L^T, A=LL^T.

The asymptotic computational time complexity for the Cholesky factorization is O(mn2), where m is the number of features (model parameters) and n the number of observations. The Cholesky factorization is used in linear least squares, Kalman filter (in the Kalman filter: Recursive algorithm section in Chapter 3, Data Preprocessing), and nonlinear Quasi-Newton optimizer.

Singular Value Decomposition (SVD)

The SVD of real matrices defines a m x n real matrix A as the product of a m square real unitary matrix, U a m x n rectangular diagonal matrix S and the transpose V^T matrix of a real matrix, A=USV^T.

The columns of the matrix U and V are the orthogonal bases and the value of the diagonal matrix S is the singular values [A:4]. The asymptotic computational time complexity for the singular value decomposition for n observations and m features is O(mn² -n³ ). The singular value decomposition is used in minimizing the total least squares and solving homogeneous linear equations.

Eigenvalue decomposition

The Eigen-decomposition of a real square matrix A is the canonical factorization as Ax = ?x.

? is the eigenvalue (scalar) corresponding to vector x. The n by n matrix A is then defined as A = QDQ^T . Q is the square matrix that contains the eigenvectors and D is the diagonal matrix wherein the elements are the eigenvalues associated to the eigenvectors [A:5], [A:6]. The Eigen-decomposition is used in Principal Components Analysis (in the Principal Components Analysis section in Chapter 5, Dimension Reduction.)

Algebraic and numerical libraries

There are many more Open Source algebraic libraries available to developers as API besides the Apache Commons Math used in Chapter 3, Data Pre-processing, Chapter 5, Dimensional Reduction, and Chapter 6, Naive Bayes Classifiers, and Apache Spark/MLlib in Chapter 17, Apache Spark MLlib.

jBlas 1.2.3 (Java) is created by Mikio Braun under BSD revised license. The library provides Java and Scala developers with a high-level Java interface to BLAS and LAPACK—https://github.com/mikiobraun/jblas
Colt 1.2.0 (Java) is a high performance scientific library developed at the CERN under the European Organization for Nuclear Research licence—http://acs.lbl.gov/ACSSoftware/colt/
AlgeBird 2.10 (Scala) is developed at Twitter under Apache Public License 2.0. It defines abstract linear algebra concepts using monoid and monads. The library is an excellent example of high-level functional programming using Scala—https://github.com/twitter/algebird
Breeze 0.8 (Scala) is a numerical processing library using Apache Public License 2.0 originally created by David Hall. It is a component of the ScalaNLP suite of machine learning and numerical computing libraries—http://www.scalanlp.org/

The Apache Spark/MLlib framework bundles jBlas, Colt, and Breeze. The Iitb framework for conditional random fields uses colt linear algebra components.

Note

Alternative to Java/Scala libraries

If your application or project needs a high performance numerical processing tool under limited resources (CPU and RAM memory), using a C/C++ compiled library is an excellent alternative if portability is not a constraint. The binary functions are accessed through the Java Native Interface (JNI).

First order predicate logic

Propositional logic is the formulation of axioms or proposition. There are several formal representations of propositions:

Noun-VERB-Adjective: Variance of the stock price EXCEEDS 0.76 or Minimization of the loss function DOES NOT converge
Entity-value= Boolean: Variance of the stock price GREATER+THAN 0.76 = true or Minimization of the loss function converge = false
Variable op value: Variance of the stock price > 0.76 or Minimization of loss function != converge

Propositional logic is subject to the rule of Boolean calculus. Let's consider three propositions P, Q, and R and the three Boolean operators NOT, AND, and OR:

NOT (NOT P) = P
P AND false = false, P AND true = P, P OR false = P, P OR true = P
P AND Q = Q AND P, P OR Q = Q OR P
P AND (Q AND R) = (P AND Q) AND R

First order predicate logic also known as first order predicate calculus is the quantification of a propositional logic [A:7]. The most common formulations of the first order logic are:

Rules IF P THEN action
Existential operators

First order logic is used to describe the classifiers in the learning classifier systems (in the Learning classifiers systems section in Chapter 13, Evolutionary Computation).

Jacobian and Hessian matrices

Let's consider a function with n variables x_i and m output y_j f: { x_i } -> {y_j =f_j (x)}. The Jacobian matrix [A:8] is the matrix of the first order partial derivatives of an output value of a continuous, differential function:

The Hessian matrix is the square matrix of the second order of partial derivatives of a continuously, twice differentiable function:

Consider the following example:

Summary of optimization techniques

Optimization is critical to the efficiency and to some extent extends the accuracy of the machine learning algorithms. Some basic knowledge in this field goes a long way to build practical solutions for large datasets.

Gradient descent methods

There are many optimization techniques that either rely on exclusively on the first order derivative or provide a linear algebra alternative to the computation of the gradient.

Steepest descent

The steepest descent (also known as gradient descent) method is one of the simplest techniques for finding a local minimum of any continuous, differentiable function F or the global minimum for any define, differentiable, convex function [A:9]. The value of a vector or data point x_t+1 at iteration t+1 is computed from the previous value x_t using the gradient F of function F and the slope ?:

The steepest gradient algorithm is used for solving a system of nonlinear equations, minimization of the loss function in the logistic regression (refer to the Numerical optimization section in Chapter 9, Regression and Regularization), support vector classifier (refer to the Non-separable case section in Chapter 12, Kernel Models and Support Vector Machine), and multilayer perceptron (refer to the Training strategy and classification section in Chapter 10, Multilayer Perceptron).

Conjugate gradient

The conjugate gradient solves unconstrained optimization problems and systems of linear equations. It is an alternative to the LU factorization for positive, definite symmetric square matrices. The solution x* of the equation A.x = b is expanded as the weighted summation of n basis orthogonal directions p_i (also known as conjugate directions):

The solution x* is extracted by computing the i^th conjugate vector p_i and then computing the coefficients a_i.

Stochastic gradient descent

The stochastic gradient method is a variant of the steepest descent that minimizes the convex function by defining the objective function F as the sum of differentiable, basis function f_i as follows:

The solution x_t+1 at iteration t+1 is computed from the value x_t at iteration t, the step size (or learning rate) a and the sum of the gradient of the basic functions [A:10]. The stochastic gradient descent is usually faster than other gradient descents or quasi-new methods in converging toward a solution for a convex function. The stochastic gradient descent is used in the logistic regression, support vector machines, and back-propagation neural networks.

Stochastic gradient is particularly suitable for discriminative models with large datasets [A:11]. Spark/MLlib makes extensive use of the stochastic gradient method.

Note

Batch gradient descent

The batch gradient descent is introduced and implemented in the Step 5: Implementing the classifier section under Let's kick the tires in Chapter 1, Getting Started.

Quasi-Newton algorithms

Quasi-Newton algorithms are variations on Newton's methods of finding the value of a vector or data point that maximizes or minimizes a function F (1st order derivative is null) [A:12].

Newton's method is a well-known and simple optimization method for finding the solution of equations F(x) = 0 for which F is continuous and 2nd order differentiable. It relies on the Taylor series expansion to approximate the function F with a quadratic approximation on variable ?x = x_t+1 - x_t to compute the value at the next iteration using the first order F' and second order F" derivatives:

Contrary to the Newton method, Quasi-Newton methods do not require that the 2nd order derivative, Hessian matrix, of the objective function be computed. It just has to be approximated [A:13]. There are several approaches to approximate the computation of the Hessian matrix.

BFGS

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) is a Quasi-Newton iterative numerical method to solve unconstrained nonlinear problems. The Hessian matrix H_t+1 at iteration t is approximated using the value of the previous iteration t as H_t+1 =H_t + U_t + V_t applied to the Newton equation for the direction p_t:

The BFGS is used in the minimization of the cost function for the conditional random field and L₁ and L₂ regression.

L-BFGS

The performance of the BFGS algorithm is related to the caching of the approximation of the Hessian matrix in memory (U, V) at the cost of high memory consumption.

The Limited memory BFGS algorithm or L-BFGS is a variant of BFGS that uses a minimum amount of computer RAM. The algorithm maintains the last m incremental updates of the values ?x_t and gradient ?G_t at iteration t, then computes these values for the next step t+1:

It is supported in Apache Commons Math 3.3+, Apache Spark/MLlib 1.0+, Colt 1.0+, and Iiitb CRF libraries. L-BFGS is used in the minimization of the loss function in the Conditional Random Fields (refer to the Conditional random fields section in Chapter 7, Sequential Data Models).

Nonlinear least squares minimization

Let's consider the classic minimization of the least squares of a nonlinear function y = F(x, w) with w_i parameters for observations {y, x_i }. The objective is to minimize the sum of the squares of residuals r_i:

Gauss-Newton

The Gauss-Newton technique is a generalization of Newton's method. The technique solves nonlinear least squares by updating the parameters w_t+1 at iteration t+1 using the first order derivative (also known as Jacobian):

The Gauss-Newton algorithm is used in Logistic Regression (refer to the Logistic regression section in Chapter 9, Regression and Regularization).

Levenberg-Marquardt

The Levenberg-Marquardt algorithm is an alternative to the Gauss-Newton for solving nonlinear least squares and curve fitting problems. The method consists of adding the gradient (Jacobian) terms to the residuals r_i to approximate the least squares error:

The algorithm is used in the training of the logistic regression (refer to the Logistic regression section in Chapter 9, Regression and Regularization).

Lagrange multipliers

The Lagrange multipliers methodology is an optimization technique to find local optima of a multivariate function subject to equality constraints [A:14]. The problem is stated as follows:

maximize f(x) subject to g(x) = c, c is a constant, x is a variable or features vector.

The methodology introduces a new variable ? to integrate the constraint g into a function, known as the Lagrange function ℒ(x, ?). Let's note the ℒ the gradient of ℒ over the variables x_i and ?. The Lagrange multipliers are computing by maximizing ℒ:

Consider the following example:

Lagrange multipliers are used in minimizing the loss function in the nonseparable case of linear support vector machines (refer to the Non-separable case section of Chapter 12, Kernel Models and Support Vector Machines).

Overview dynamic programming

The purpose of dynamic programming is to break down an optimization problem into a sequence of steps known as substructures [A:15]. There are two types of problems for which dynamic programming is suitable.

The solution of a global optimization problem can be broken down into optimal solutions for its subproblems. The solution of the subproblems is known as optimal substructures. Greedy algorithms or the computation of the minimum span of a graph are examples of the decomposition into optimal substructures. Such algorithms can be implemented either recursively or iteratively.

The solution of the global problem is applied recursively to the subproblems, where the number of subproblems is small. This approach is known as dynamic programming using overlapping substructures. Forward-backward passes on hidden Markov models, the Viterbi algorithm (refer to the Viterbi algorithm section in Chapter 12, Sequential Data Models) or the back-propagation of error in a multilayer perceptron (refer to the Step 3 - Error backpropagation section in Chapter 10, Multilayer Perceptron) are good examples of overlapping substructures.

The mathematical formulation of dynamic programming solutions is specific to the problem it attempts to resolve. Dynamic programming techniques are also commonly used in mathematical puzzles such as the tour of Hanoi.

Scala for Machine Learning, Second Edition - Second Edition

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Scala for Machine Learning, Second Edition - Second Edition

Scala Machine Learning Projects

A Handbook of Mathematical Models with Python

Mastering Predictive Analytics with R

Mathematics

Linear algebra

QR decomposition

LU factorization

LDL decomposition

Cholesky factorization

Singular Value Decomposition (SVD)

Eigenvalue decomposition

Algebraic and numerical libraries

Note

First order predicate logic

Jacobian and Hessian matrices

Summary of optimization techniques

Gradient descent methods

Steepest descent

Conjugate gradient

Stochastic gradient descent

Note

Quasi-Newton algorithms

BFGS

L-BFGS

Nonlinear least squares minimization

Gauss-Newton

Levenberg-Marquardt

Lagrange multipliers

Overview dynamic programming