## Mathematics

This section very briefly describes some of the mathematical concepts used in this book.

### Linear algebra

Many algorithms used in machine learning such as minimization of a convex loss function, principal component analysis, or least squares regression invariably involves manipulation and transformation of matrices. There are many good books on the subject, from the inexpensive [A:2] to the sophisticated [A:3].

#### QR decomposition

The QR decomposition (or the QR factorization) is the decomposition of a matrix *A* into a product of an orthogonal matrix *Q* and upper triangular matrix *R*. So, *A=QR* and *Q ^{T}*

*Q=I*[A:4].

The decomposition is unique if *A* is a real, square, and invertible matrix. In the case of a rectangle matrix *A*, *m by n* with *m > n*, the decomposition is implemented as the dot product of two vector of matrices: *A = [Q _{1}, Q_{2}].[R_{1}, R_{2}]^{T}*, where

*Q*is an

_{1}*m by n*matrix,

*Q*is an

_{2}*m by n*matrix,

*R*is an

_{1}*n by n*upper triangle matrix, and

*R*is an

_{2}*m by n*null matrix.

The QR decomposition is a reliable method used to solve a large system of linear equations for which the number of equations (rows) exceeds the number of variables (columns). Its asymptotic computational time complexity for a training set of *m* dimensions and *n* observations is *O(mn ^{2}-n^{3}/3)*.

It is used to minimize the loss function for ordinary least squares regression (refer to the *Ordinary least squares regression* section in Chapter 6, *Regression and Regularization*).

#### LU factorization

**LU factorization**
is a technique used to solve a matrix equation *A.x = b*, where *A* is a nonsingular matrix and *x* and *b* are two vectors. The technique consists of decomposing the original matrix *A* as the product of a simple matrix *A= A _{1}A_{2}…A_{n}*.

**Basic LU factorization**: This defines*A*as the product of a lower unit triangular matrix*L*and an upper triangular matrix*U*. So,*A=LU*.**LU factorization with a pivot**: This defines*A*as the product of a permutation matrix*P*, a lower unit triangular matrix*L*, and an upper triangular matrix*U*. So,*A=PLU*.

#### LDL decomposition

**The **
**LDL decomposition**
for real matrices defines a real positive matrix *A* as the product of a lower unit triangular matrix *L*, a diagonal matrix *D*, and the transposed matrix of *L*, that is, *L ^{T}*. So,

*A=LDL*.

^{T}#### Cholesky factorization

The **Cholesky**
** factorization** (or the **Cholesky decomposition**) of real matrices is a special case of the LU factorization [A:4]. It decomposes a positive definite matrix *A* into a product of a lower triangular matrix *L* and its conjugate transpose *L ^{T}*. So,

*A=LL*.

^{T}The asymptotic computational time complexity for the Cholesky factorization is *O(mn ^{2})*, where

*m*is the number of features (model parameters) and

*n*is the number of observations. The Cholesky factorization is used in linear least squares Kalman filter (refer to the

*The recursive algorithm*section in Chapter 3,

*Data Preprocessing*) and nonlinear Quasi-Newton optimizer.

#### Singular Value Decomposition

The **singular value decomposition** (**SVD**) of real matrices defines an *m by n* real matrix *A* as the product of an *m* real square unitary matrix *U*, an *m by n* rectangular diagonal matrix *Σ*, and the transpose matrix *V ^{T}* of a real matrix. So,

*A=UΣV*.

^{T}The columns of the *U* and *V* matrices are the orthogonal bases and the value of the diagonal matrix *Σ* is a singular value [A:4]. The asymptotic computational time complexity for the singular value decomposition for *n* observations and *m* features is *O(mn ^{2}-n^{3})*. The singular value decomposition is used to minimize the total least squares and solve homogeneous linear equations.

#### Eigenvalue decomposition

The Eigen decomposition of a real square matrix *A* is the canonical factorization, *A*
*x = λx*.

*λ* is the **eigenvalue** (scalar) corresponding to the vector *x*. The *n by n* matrix *A* is then defined as *A = QDQ ^{T}*.

*Q*is the square matrix that contains the eigenvectors and

*D*is the diagonal matrix whose elements are the eigenvalues associated with the eigenvectors [A:5] and [A:6]. The Eigen decomposition is used in Principal Components Analysis (refer to the

*Principal components analysis*section in Chapter 4,

*Unsupervised Learning*).

#### Algebraic and numerical libraries

There are many more open source algebraic libraries available to developers as APIs besides Apache Commons Math, which is used in Chapter 3, *Data Preprocessing*, Chapter 5, *Naïve Bayes Classifiers*, Chapter 6, *Regression and Regularization*, and Apache Spark/MLlib in Chapter 12, *Scalable Frameworks*. They are as follows:

**jBlas 1.2.3**(Java) created by Mikio Braun under the BSD revised license. This library provides Java and Scala developers a high-level Java interface to**BLAS**and**LAPACK**(https://github.com/mikiobraun/jblas).**Colt 1.2.0**(Java) is a high-performance scientific library developed at CERN under the European Organization for Nuclear Research license (http://acs.lbl.gov/ACSSoftware/colt/).**AlgeBird 2.10**(Scala) developed at Twitter under Apache Public License 2.0. It defines concepts of abstract linear algebra using monoid and monads. This library is an excellent example of high-level functional programming using Scala (https://github.com/twitter/algebird).**Breeze 0.8**(Scala) is a numerical processing library using Apache Public License 2.0 originally created by David Hall. It is a component of the ScalaNLP suite of machine learning and numerical computing libraries (http://www.scalanlp.org/).

The Apache Spark/MLlib framework bundles jBlas, Colt, and Breeze. The Iitb framework for conditional random fields uses Colt linear algebra components.

### Note

**An alternative to Java/Scala libraries**

If your application or project needs a high-performance numerical processing tool under limited resources (CPU and RAM memory), then using a C/C++ compiled library is an excellent alternative if portability is not a constraint. The binary functions are accessed through the Java Native Interface (JNI).

### First order predicate logic

**Propositional logic** is the formulation of **axioms** or propositions. There are several formal representations of propositions:

**Noun-VERB-Adjective**: For example,*Variance of the stock price EXCEEDS 0.76*or*Minimization of the loss function DOES NOT converge***Entity-value = Boolean**: For example,*Variance of the stock price GREATER+THAN 0.76 = true*or*Minimization of the loss function converge = false***Variable op value**: For example,*Variance_stock_price > 0.76*or*Minimization_loss_function != converge*

Propositional logic is subject to the rules of Boolean calculus. Let's consider three propositions: *P*, *Q*, and *R* and the three Boolean operators *NOT*, *AND*, and *OR*:

*NOT (NOT P) = P**P AND false = false*,*P AND true = P*,*P or false = P*, and*P or true = P**P AND Q = Q AND P*and*P OR Q = Q OR P**P AND (Q AND R) = (P AND Q) AND R*

**First order predicate logic,** also known as **first order predicate calculus,** is the quantification of a propositional logic [A:7]. The most common formulations of the first order logic are as follows:

Rules (for example,

*IF P THEN action*)Existential operators

First order logic is used to describe the classifiers in the learning classifier systems (refer to the *XCS rules* section in Chapter 11, *Reinforcement Learning*).

### Jacobian and Hessian matrices

Let's consider a function with *n* variables *x _{i}* and

*m*outputs

*y*such that

_{j}*f: {x*.

_{i}} -> {y_{j}=f_{j}(x)}The **Jacobian matrix** [A:8] is the matrix of the first order partial derivatives of the output values of a continuous, differential function:

The **Hessian matrix** is the square matrix of the second order partial derivatives of a continuously, twice differentiable function:

An example is as follows:

### Summary of optimization techniques

The same comments regarding linear algebra algorithms apply to optimization. Treating such techniques in depth would have rendered the book impractical. However, optimization is critical to the efficiency and, to a lesser extent, the accuracy of the machine learning algorithms. Some basic knowledge in this field goes a long way to build practical solutions for large datasets.

#### Gradient descent methods

##### Steepest descent

The **steepest descent** (or gradient descent) method is one of the simplest techniques used to find a local minimum of any continuous, differentiable function *F* or the global minimum for any defined, differentiable, and convex function [A:9]. The value of a vector or data point *x _{t+1}* at iteration

*t+1*is computed from the previous value

*x*using the

_{t}*gradient*

*∇*

*F*of function

*F*and the slope

*γ*:

The steepest gradient algorithm is used to solve systems of nonlinear equations and minimization of the loss function in the logistic regression (refer to the *Numerical optimization* section in Chapter 6, *Regression and Regularization*), in support vector classifiers (refer to the *The nonseparable case – the soft margin* section in Chapter 8, *Kernel Models and Support Vector Machines*), and in multilayer perceptrons (refer to the *Training and classification* section in Chapter 9, *Artificial Neural Networks*).

##### Conjugate gradient

The **conjugate gradient** solves unconstrained optimization problems and systems of linear equations. It is an alternative to the LU factorization for positive, definite, and symmetric square matrices. The solution *x** of the equation *Ax = b* is expanded as the weighted summation of *n* basis orthogonal directions *p _{i}* (or

**conjugate directions**):

The solution *x** is extracted by computing the *i ^{th}* conjugate vector

*p*and then computing the coefficients

_{i}*α*.

_{i}##### Stochastic gradient descent

The **stochastic gradient** method is a variant of the steepest descent that minimizes the convex function by defining the objective function *F* as the sum of differentiable, basis function *f _{i}*:

The solution *x _{t+1}* at iteration

*t+1*is computed from the value

*x*at iteration

_{t}*t*, the step size (or the learning rate)

*α*, and the sum of the gradient of the basis functions [A:10]. The stochastic gradient descent is usually faster than other gradient descents or quasi-Newton methods in converging toward a solution for convex functions. The stochastic gradient descent is used in logistic regression, support vector machines, and backpropagation neural networks.

Stochastic gradient is particularly suitable for discriminative models with large datasets [A:11]. Spark/MLlib makes extensive use of the stochastic gradient method.

### Note

**The batch gradient descent**

The batch gradient descent is introduced and implemented in the *Step 5 – implementing the classifier* section under *Let's kick the tires* in Chapter 1, *Getting Started*.

#### Quasi-Newton algorithms

**Quasi-Newton** algorithms are variations of Newton's method of finding the value of a vector or data point that maximizes or minimizes a function *F* (first order derivative is null) [A:12].

The Newton's method is a well-known and simple optimization method used to find the solution of equations *F(x) = 0* for which *F* is continuous and second order differentiable. It relies on the Taylor series expansion to approximate the function *F* with a quadratic approximation of variable *∆x = x _{t+1}-x_{t}* to compute the value at the next iteration using the first order

*F'*and second order

*F"*derivatives:

Contrary to Newton's method, quasi-Newton methods do not require that the second order derivative, Hessian matrix, of the objective function be computed; it just has to be approximated [A:13]. There are several approaches to approximate the computation of the Hessian matrix.

##### BFGS

The **Broyden-Fletcher-Goldfarb-Shanno** (**BGFS**) is a quasi-Newton iterative numerical method used to solve unconstrained nonlinear problems. The hessian matrix *H _{t+1}* at iteration

*t*is approximated using the value of the previous iteration

*t*as

*H*applied to the Newton equation for the direction

_{t+1}=H_{t}+ U_{t}+ V_{t}*p*:

_{t}The BFGS is used in the minimization of the cost function for the conditional random field and L_{1} and L_{2} regressions.

##### L-BFGS

The performance of the BFGS algorithm is related to the caching of the approximation of the Hessian matrix in the memory (*U*, *V*) at the cost of high-memory consumption.

The **Limited memory Broyden-Fletcher-Goldfarb-Shanno** (**L-BFGS**) algorithm is a variant of BFGS that uses a minimum amount of computer RAM. The algorithm maintains the last *m* incremental updates of the values *∆x _{t}* and gradient

*∆G*at iteration

_{t}*t*, and then computes these values for the next step

*t+1*:

It is supported by the Apache Commons Math 3.3+, Apache Spark/MLlib 1.0+, Colt 1.0+, and Iiitb CRF libraries. L-BFGS is used in the minimization of the loss function in conditional random fields (refer to the *Conditional random fields* section in Chapter 7, *Sequential Data Models*).

#### Nonlinear least squares minimization

Let's consider the classic minimization of the least squares of a nonlinear function *y = F(x, w)* with *w _{i}* parameters for observations

*{y, x*. The objective is to minimize the sum of the squares of residuals

_{i}}*r*, which is as follows:

_{i}##### Gauss-Newton

The Gauss-Newton technique is a generalization of Newton's method. The technique solves nonlinear least squares by updating the parameters *w _{t+1}* at iteration

*t+1*using the first order derivative (or Jacobian):

The Gauss-Newton algorithm is used in logistic regression (refer to the *Logistic regression* section in Chapter 6, *Regression and Regularization*).

##### Levenberg-Marquardt

The Levenberg-Marquardt algorithm is an alternative to the Gauss-Newton technique used to solve nonlinear least squares and curve fitting problems. The method consists of adding the gradient (Jacobian) terms to the residuals *r _{i}* to approximate the least squares error:

The Levenberg-Marquardt algorithm is used in the training of logistic regression (refer to the *Logistic regression* section in Chapter 6, *Regression and Regularization*).

#### Lagrange multipliers

The **Lagrange multipliers** methodology is an optimization technique used to find the local optima of a multivariate function, subject to equality constraints [A:14]. The problem is stated as *maximize f(x) subject to g(x) = c, where c is a constant and x is a variable or features vector*.

This methodology introduces a new variable *λ* to integrate the constraint *g* into a function, known as the Lagrange function *ℒ*
*(x, λ)*. Let's note *∇ℒ*, which is the gradient of *ℒ* over the variables *x _{i}* and

*λ*. The Lagrange multipliers are computed by maximizing

*ℒ*:

An example is as follows:

Lagrange multipliers are used to minimize the loss function in the nonseparable case of linear support vector machines (refer to the *The nonseparable case – the soft margin case* section in Chapter 8, *Kernel Models and Support Vector Machines*).

### Overview of dynamic programming

The purpose of **dynamic programming** is to break down an optimization problem into a sequence of steps known as **substructures** [A:15]. There are two types of problems for which dynamic programming is suitable.

The solution of a global optimization problem can be broken down into optimal solutions for its subproblems. The solution of the subproblems is known as **optimal substructures**. Greedy algorithms or the computation of the minimum span of a graph are examples of the decomposition into optimal substructures. Such algorithms can be implemented either recursively or iteratively.

The solution of the global problem is applied recursively to the subproblems if the number of subproblems is small. This approach is known as dynamic programming using **overlapping substructures**. Forward-backward passes on hidden Markov models, the Viterbi algorithm (refer to *The Viterbi algorithm* section in Chapter 7, *Sequential Data Models*), or the backpropagation of error in a multilayer perceptron (refer to the *Step 2 – error backpropagation* section in Chapter 9, *Artificial Neural Networks*) are good examples of overlapping substructures.

The mathematical formulation of a dynamic programming solution is specific to the problem it attempts to resolve. Dynamic programming techniques are also commonly used in mathematical puzzles such as *The Tower of Hanoi*.