Scala for Machine Learning, Second Edition

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies. The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naïve Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You’ll move on to evolutionary computing, multibandit algorithms, and reinforcement learning. Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.

Scala for Machine Learning Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started

Mathematical notations for the curious

Why machine learning?

Why Scala?

Model categorization

Taxonomy of machine learning algorithms

Leveraging Java libraries

Tools and frameworks

Source code

Let's kick the tires

Summary

Data Pipelines

Modeling

Defining a methodology

Monadic data transformation

Workflow computational model

Profiling data

Assessing a model

Summary

Data Preprocessing

Time series in Scala

Moving averages

Fourier analysis

The discrete Kalman filter

Alternative preprocessing techniques

Summary

Unsupervised Learning

K-mean clustering

Expectation-Maximization (EM)

Summary

Dimension Reduction

Challenging model complexity

The divergences

Principal components analysis (PCA)

Nonlinear models

Summary

Naïve Bayes Classifiers

Probabilistic graphical models

Naïve Bayes classifiers

Multivariate Bernoulli classification

Naïve Bayes and text mining

Pros and cons

Summary

Sequential Data Models

Markov decision processes

The hidden Markov model (HMM)

Conditional random fields

Regularized CRF and text analytics

Comparing CRF and HMM

Performance consideration

Summary

Monte Carlo Inference

The purpose of sampling

Gaussian sampling

Monte Carlo approximation

Bootstrapping with replacement

Markov Chain Monte Carlo (MCMC)

Summary

Regression and Regularization

Linear regression

Regularization

Numerical optimization

Logistic regression

Summary

Multilayer Perceptron

Feed-forward neural networks (FFNN)

The multilayer perceptron (MLP)

Evaluation

Benefits and limitations

Summary

Deep Learning

Sparse autoencoder

Restricted Boltzmann Machines (RBMs)

Convolution neural networks

Kernel Models and SVM

Kernel functions

The support vector machine (SVM)

Performance considerations

Summary

Evolutionary Computing

Evolution

Genetic algorithms and machine learning

Genetic algorithm components

Implementation

GA for trading strategies

Advantages and risks of genetic algorithms

Summary

Multiarmed Bandits

K-armed bandit

Thompson sampling

Upper bound confidence

Summary

Reinforcement Learning

Reinforcement learning

Learning classifier systems

Summary

Parallelism in Scala and Akka

Overview

Scala

Scalability with Actors

Akka

Summary

Apache Spark MLlib

Overview

Apache Spark core

MLlib library

Reusable ML pipelines

Extending Spark

Streaming engine

Performance evaluation

Pros and cons

Summary

Basic Concepts

Scala programming

Mathematics

Finances 101

Suggested online courses

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Let's kick the tires

This final section introduces the key elements of the training and classification workflow. A test case using a simple logistic regression is used to illustrate each step of the computational workflow.

Writing a simple workflow

The book relies on financial data in order to experiment with different learning strategies. The objective of the exercise is to build a model that can discriminate between volatile and non-volatile trading sessions for stock or commodities. For the first example, we have selected a simplified version of the binomial logistic regression as our classifier, as we treat stock price-volume action as a continuous or pseudo-continuous process.

Note

Introduction to logistic regression

Logistic regression is treated in depth in the Logistic regression section in Chapter 9, Regression and Regularization. The model treated in this example is the simple binomial logistic regression classifier for two-dimension observations.

The classification of trading sessions according to their volatility and volume is as follows:

Scoping the problem.
Loading data.
Preprocessing raw data.
Discovering patterns, whenever possible.
Implementing the classifier.
Evaluating the model.

Step 1 – scoping the problem

The objective here is to create a model for stock price using its daily trading volume and volatility. Throughout the book, we will rely on financial data to evaluate and discuss the merits of different data processing and machine learning methods. In this example, the data is extracted from Yahoo Finances using the CSV format with the following fields:

Date
Price at open
Highest price in session
Lowest price in session
Price at session close
Volume
Adjust price at session close

The enumerator YahooFinancials extracts historical daily trading information from the Yahoo finance site:

type Features = Array[Double]
type Weights = Array[Double]
type ObsSet = Vector[Features]
type Fields = Array[String]

object YahooFinancials extends Enumeration {
  type YahooFinancials = Value
  val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE=Value
  def toDouble(v: Value): Fields => Double =   //1
     (s: Fields) => s(v.id).toDouble
   def toArray(vs: Array[Value]): Fields => Features = //2 
       (s: Fields) => vs.map(v => s(v.id).toDouble)   
  
}

The method toDouble converts an array of a string into a single value (line 1) and toArray converts an array of a string into an array of values (line 2). The enumerator YahooFinancials is described in detail in the Data sources section in the Appendix.

Let's create a simple program that loads the content of the file, executes some simple preprocessing functions, and creates a simple model. We selected the CSCO stock price between January 1, 2012 and December 1, 2013 as our data input.

Let's consider two variables, price and volume, as illustrated by the following screenshot. The top graph displays the variation of the price of Cisco stock over time and the bottom bar chart represents the daily trading volume on Cisco stock over time:

Price-volume action for Cisco stock 2012-2013

Step 2 – loading data

The second step is loading the dataset from local or remote data storage. Typically, a large dataset is loaded from a database or distributed filesystem such as Hadoop Distributed File System (HDFS). The load method takes an absolute path name, extract, and transforms the input data from a file into a time series of type Vector[DblPair]:

def load(fileName: String): Try[Vector[DblPair]] = Try {
  val src =  Source.fromFile(fileName)  //3
  val data = extract(src.getLines.map(_.split(",")).drop(1))//4
  src.close //5
  data
 }

The data file is extracted through a invocation of the static method Source.fromFile (line 3), then the fields are extracted through a map before the header (the first row in the file) is removed using drop (line 4). The file has to be closed to avoid leaking the file handle (line 5).

Note

Data extraction

The method invocation pipeline Source.fromFile.getLines.map returns an Iterator, which can be traversed only once.

The purpose of the extract method is to generate a time series of two variables (relative stock volatility and relative stock daily trading volume):

def extract(cols: Iterator[Fields]): ObsSet = {
  val features = Array[YahooFinancials](LOW, HIGH, VOLUME) //6
  val conversion = toArray(features)  //7
  cols.map(conversion(_)).toVector   
      .map(x => Array[Double](1.0 - x(0)/x(1), x(2)))  //8
}

The only purpose of the extract method is to convert the raw textual data into a two-dimension time series. The first step consists of selecting the three features to extract: LOW (lowest stock price in the session), HIGH (highest price in the session), and VOLUME (trading volume for the session) (line 6). This feature set is used to convert each line of the fields into a corresponding set of three values (line 7). Finally, the feature set is reduced to two variables (line 8):

Relative volatility of stock price in a session, 1.0 – LOW/HIGH
Trading volume for the stock in the session, VOLUME
Note
Code readability
A long pipeline of Scala high-order methods makes the code and underlying code quite difficult to read. It is recommended to take long chains of method calls, such as the following:
```
val cols =    
    Source.fromFile.getLines.map(_.split(",")).toArray.drop(1)
```
Then, break them down into several steps:
```
val lines = Source.fromFile.getLines
val fields = lines.map(_.split(",")).toArray
val cols = fields.drop(1)
```
We strongly encourage the reader to consult the excellent guide Effective Scala written by Marius Eriksen from Twitter. This is definitively a must-read for any Scala developer [1:11].

Step 3 – preprocessing data

The next step is to normalize the data in the range [0.0, 1.0] to be trained by the binomial logistic regression. It is time to introduce an immutable and flexible normalization class.

Immutable normalization

Logistic regression relies on the sigmoid curve or logistic function described in the Logistic function section in Chapter 9, Regression and Regularization. The logistic function is used to segregate training data into classes. The output value of the logistic function ranges from 0 for x = - INFINTY to 1 for x = + INFINITY. Therefore, it makes sense to normalize the input data or observation over [0, 1].

Note

To normalize or not normalize?

The purpose of normalizing data is to impose a single range of values for all the features, so the model does not favor any particular feature. Normalization techniques include linear normalization and Z-score. Normalization is an expensive operation that is not always needed.

Normalization is a linear transformation on the raw data that can be generalized to any range [l, h].

Note

Linear normalization

M2: [0, 1] Normalization features {x_i} with minimum xmin, maximum xmax values:

M3: [l, h] Normalization of features {x_i}:

The normalization of input data in supervised learning has a specific requirement: the classification and prediction of new observations have to use the normalization parameters (min, max) extracted from the training set, so all observations share the same scaling factor.

Let's define the normalization class MinMax. The class is immutable: the minimum, min, and maximum, max, values are computed within the constructor. The class takes a time series of the parameterized type T values as an argument (line 8). The steps of the normalization process are defined as follows:

Initialize the minimum values for a given time series during instantiation (line 9).
Compute the normalization parameters (line 10) and normalize the input data (line 11).

Normalize any new data point reusing the normalization parameters (line 14):

class MinMax[T : ToDouble](val values: Vector[T])
  { //8
  val zero = (Double.MaxValue, -Double.MaxValue)
  val (min, max) = values./:(zero){ case ((mn, mx),x) => {
    val z = implicitly[ToDouble[T]].apply(x)
    (if(z < mn) z else mn, if(z > mx) z else mx)  //9
  }} 
  case class ScaleFactors(
    low:Double, high:Double, ratio: Double
  )
  
  var scaleFactors: Option[ScaleFactors] = None //10
  def normalize(low: Double, high: Double): Vector[Double]//11
  def normalize(value: Double): Double
}

The class constructor computes the tuple of minimum and maximum values minMax using a fold (line 9). The scaling parameters scaleFactors are computed during the normalization of the time series (line 11), described as follows. The method normalize initializes the scaling factors parameters (line 12) before normalizing the input data (line 13):

def normalize(low: Double, high: Double): Vector[Double] = 
  setScaleFactors(low, high).map( scale => { //12
    values.map(x =>{
      val z = implicitly[ToDouble[T]].apply(x)
      (z - min)*scale.ratio + scale.low //13
    }) 
  }).getOrElse(/* … */)

def setScaleFactors(l: Double, h: Double): Option[ScaleFactors]={
    // .. error handling code
   Some(ScaleFactors(l, h, (h - l)/(max - min))
}

Subsequent observations use the same scaling factors extracted from the input time series in normalize (line 14):

def normalize(value: Double): Double = setScaleFactors.map(
scale => 
   if(value < min) scale.low
   else if (value > max) scale.high
   else (value - min)* scale.high + scale.low
).getOrElse( /* … */)

The class MinMax normalizes single variable observations.

Note

Statistics class

The class that extracts the basic statistics from a dataset, Stats, introduced in the Profiling data section in Chapter 2, Data Pipelines, inherits the class MinMax.

The test case with the binomial logistic regression uses a multiple variable normalization, implemented by the class MinMaxVector which takes observations of type Vector[Array[Double]] as input:

class MinMaxVector(series: Vector[Double]) {
  val minMaxVector: Vector[MinMax[Double]] = //15
      series.transpose.map(new MinMax[Double](_))
  def normalize(low: Double, high: Double): Vector[Double]
}

The constructor of the class MinMaxVector transposes the vector of an array of observations in order to compute the minimum and maximum values for each dimension (line 15).

Step 4 – discovering patterns

The price action chart has a very interesting characteristic.

Analyzing data

At a closer look, a sudden change in price and increase in volume occurs about every 3 months or so. Experienced investors will undoubtedly recognize that these price-volume patterns are related to the release of quarterly earnings of Cisco. Such a regular but unpredictable pattern can be a source of concern or opportunity if risk can be properly managed. The strong reaction of the stock price to the release of corporate earnings may scare some long-term investors while enticing day traders.

The following graph visualizes the potential correlation between sudden price change (volatility) and heavy trading volume:

Price-volume correlation for Cisco stock 2012-2013

The next section is not required for the understanding of the test case. It illustrates the capabilities of JFreeChart as a simple visualization and plotting library.

Plotting data

Although charting is not the primary goal of this book, we thought that you would benefit from a brief introduction to JFreeChart.

Note

Plotting classes

This section illustrates a simple Scala interface to JFreeChart java classes. Its reading is not required for the understanding of machine learning. The visualization of the results of a computation is beyond the scope of this book.

Some of the classes used in visualization are described in the Appendix.

The dataset (volatility, volume) is converted into internal JFreeChart data structures.

The following code snippet defines the key components of a simple scatter plot:

class ScatterPlot(config: PlotInfo, theme: PlotTheme) {//16
  def display(xy: Vector[DblPair], width: Int, height) //17

  // ….
}

The class ScatterPlot implements a simple configurable scatter plot with the following arguments:

config: Information, labels, and fonts of the plot
theme: Predefined theme for the plot (black, white background, and so on)

The class PlotTheme defines a specific theme or preconfiguration of the chart (line 16). The class offers a set of methods with the name display to accommodate for a wide range of data structures and configuration (line 17).

Note

Visualization

The JFreeChart library is introduced as a robust charting tool. The code related to plots and charts is omitted throughout the book in order to keep the code snippets concise. On a few occasions, output data is formatted in an CSV file to be imported into a spreadsheet.

Visualizing model features

The ScatterPlot.display method is used to display the normalized input data used in the binomial logistic regression, as follows:

val plot = new ScatterPlot(("CSCO 2012-13 Model features", 
   "Normalized session volatility", "Normalized session Volume"), 
    new BlackPlotTheme)
plot.display(volatilityVolume, 250, 340)

The invocation of the method display generates the following output:

Scatter plot of volatility and volume for Cisco stock 2012-2013

The scatter plot shows some level of correlation between session volume and session volatility and confirms the initial finding in the stock price and volume chart. We can leverage this information to classify trading sessions by their volatility and volume. The next step is to create a two-class model by loading a training set, observations, and expected values into our logistic regression algorithm. The classes are delimited by a decision boundary (also known as a hyperplane) drawn onto the scatter plot.

Visualizing label

The normalized variation of the stock price between the opening and closing of the trading session is selected as the label for this classifier:

Classifier training label: normalized variation of stock price within a trading session

Step 5 – implementing the classifier

The objective of this training is to build a model that can discriminate between volatile and non-volatile trading sessions. For the sake of the exercise, session volatility is defined as the relative difference between a session's highest price and lowest price. The total trading volume within a session constitutes the second parameter of the model. The relative price movement within a trading session (that is, closing price/open price -1) is our expected value or label.

Logistic regression is commonly used in statistics inference.

Note

Logistic regression model (M4)

Given a model with weight w_i, the margin f and the logistic function l are defined as:

The first weight w₀ is known as the intercept. The binomial logistic regression is described in detail in the Logisticregression section in Chapter 9, Regularization and Regression.

The following implementation of the binomial logistic regression classifier exposes a single method, classify, to comply with our desire to reduce the complexity and life cycle of objects. The model parameters, weights, are computed during training when the class/model LogBinRegression is instantiated. As mentioned earlier, the sections of the code non-essential to the understanding of the algorithm are omitted.

The constructor LogBinRegression has five arguments (line 18):

observations: Vector observations representing volume and volatility
expected: A vector of expected values (relative price movement)
maxIters: The maximum number of iterations allowed for the optimizer to extract the regression weights during training
eta: Learning or training rate

eps: The maximum value of the error (predicted – expected) for which the model is valid:

class LogBinRegression(
     observations: Vector[Features], 
     expected: Vector[Double],
     maxIters: Int, 
     eta: Double, 
     eps: Double) {   //18
   val model: LogBinRegressionModel = train         //19
   def classify(obs: Feature): Try[(Int, Double)]  //20 
   def train: LogBinRegressionModel
   def intercept(weights: Weights): Double
   …
}

The model LogBinRegressionModel is generated through training during the instantiation of the logistic regression class, LogBinRegression (line 19):

case class LogBinRegressionModel(
   weights: Weights, 
   losses: List[Double]
)

The model is fully defined by its weights as described in the mathematical formula M4. The intercept weights(0) represents the mean value of the prediction for observations whose variables are zero. The list losses contain the logistic loss collected at each iteration. It is used for debugging purposes. The intercept does not have a specific meaning in most cases and it is not always computable.

Note

To intercept or not intercept?

The intercept corresponds to the value of weights when the observations have null values. It is a common practice to estimate, whenever possible, the intercept for binomial linear or logistic regressions independently from the slope of the model in the minimization of the error function. The multinomial regression models treat the intercept or weight w0 as part of the regression model, as described in the Ordinary least square regression section of Chapter 9, Regression and Regularization.

The following code snippet implements the computation of the intercept given a model, Weights:

def intercept(weights: Weights): Double = {
  val zeroObs = obsSet.filter(_.exists(_ > 0.01))
  if( zeroObs.size > 0)
    -zeroObs.aggregate(0.0)(
      (s,z) => s + dot(z, weights), _ + _ 
     )/zeroObs.size
  else 0.0
}

The classify method takes new observations as input and computes the index of the classes (0 or 1) that the observations belong to, along with the actual likelihood (line 20).

Selecting an optimizer

The goal of the training of a model using expected values is to compute the optimal weights that minimize the error or loss function.

Note

Least squares or logistic loss

The sum of least squares loss is more often used for regression problems while the logistic loss is more commonly applied to classification.

We select the Stochastic Gradient Descent (SGD) algorithm to minimize the cumulative error between the predicted and expected values for all the observations. Although there are quite a few alternative optimizers, the SGD is quite robust and simple enough for this first chapter. The algorithm consists of updating the weights wi of the regression model by minimizing the cost.

Note

Cost functions

M5: Logistic loss

M6: SGD method to update model weights at iteration t, wt:

For those interested in learning about about optimization techniques, the Summary of optimization techniques section in the Appendix presents an overview of the most commonly used optimizers. The stochastic descent gradient is used for the training of the multilayer perceptron (refer to the The training epoch subsection in the The multilayer perceptron (MLP) section of Chapter 10, Multilayer Perceptron for more detail).

The execution of the SGD algorithm follows these steps:

Initialize the weights of the regression model.
Shuffle the order of observations and expected pair of values.
Select the first pair of observations and expected value.
Compute the loss for this pair.
Update the model weights using the derivatives of the loss over each weight.
Repeat from step 3 until either the maximum number of iterations is reached or the incremental update of the loss is close to zero.

The purpose of shuffling the order of the observations between iterations is to avoid the minimization of the cost reaching a local minimum.

Note

Batch and SGD

The SGD is a variant of the gradient descent which updates the model weights after computing the error on each observation. Although the SGD requires a higher computation effort to process each observation, it converges toward the optimal value of weights fairly quickly after a small number of iterations. However, the SGD is sensitive to the initial value of the weights and the selection of the learning rate, which is usually defined by an adaptive formula.

Training the model

The training method, train, consists of iterating through the computation of the weight using a simple descent gradient method. The method train computes the weights, collects the logistic loss, losses, at each iteration and returns an instance of the model LogBinRegressionModel. The code is represented here:

def train: LogBinRegressionModel = {
   val init = Array.fill(nWeights)(Random.nextDouble) //22
   val (weights, losses) = sgd(
      0,init, List[Double]()
   )
   new LogBinRegressionModel(weights, losses.reverse)  //23
}

The method train extracts the number of weights, nWeights, for the regression model as the number of variables in each observation + 1 (line 21). The method initializes the weights with random values over [0, 1] (line 22). The weights are computed through the tail recursive method sgd and the method returns a new model for the binomial logistic regression (line 23).

Note

Unwrapping values from Try:

It is not usually recommended to invoke the method get to a Try value, unless it is enclosed in a Try statement. The best course of action is to do the following:

- catch the failure with match{ case Success(m) => .case Failure(e) =>}
- extract safely the result getOrElse( /* … */ )
- propagate the results as a Try type map( _.m)

Let's look at the computation for the weights through the minimization of the loss function in the sgd method:

val shuffled = shuffle(observations.zip(expected)) //24
@tailrec
def sgd(   nIters: Int, 
   weights: Weights,//25
   losses: List[Double]): (Weights, List[Double]
 ) = {  //26 
  if(nIters >= maxIters) 
     (weights, losses)  //27
  else {
     val (x, y) = shuffled(nIters % observations.size)
     val (newLoss, grad) = { 
      val yDot = y * margin(x, weights)
       val gradient = derivativeLoss(y, yDot)
         (logisticLoss(yDot),  // 28
           Array[Double](gradient) ++ x.map(_ *gradient) )//29
    }

     if(newLoss < eps)  //30
       (weights, newLoss :: losses)  //31
     else {
       val newWeights = weights.zip(grad).map{ 
          case (w, df) => w - eta*df //33
       } 
       sgd(
         nIters+1, //34
         newWeights,
         newLoss :: losses)
     }
}

The sgd method recurses on the following arguments:

The next labeled observation defined as a pair (observation, label) (line 24)
The current number of iterations, nIters
The model weights computed in the previous recursion (line 25)
The current list of logistic loss values, losses, for debugging purposes (line 26)
Note
SGD implementation
This recursive implementation of SGD is simple and understandable but far from optimized. The different incarnation of SGD is a very well researched and documented field [1:12].

The method returns the pair of weights and the list of losses computed at each iteration if the maximum number of iterations allowed for the optimization is reached (line 27). The client code evaluates either the size of the losses list or extracts its head value to validate whether SGD converged.

Note

SGD exit strategies

There are many different possible behaviors when the SGD reaches the maximum allowed number of iterations:

Returns the final weights with a warning or a flag
Throws an exception with a recovery mechanism
Allows more iterations

The formula, M4, for the computation of the loss (line 28) and the gradient of the loss over each weight in formula, M5 (line 29), relies on two simple methods: logisticLoss and derivativeLoss. The code is as follows:

def logisticLoss(z: Double): Double = 
  log(1.0 + exp(-z)) / observations.size //30
def derivedLoss(y: Double, yDot: Double):Double = 
 -y / (1.0 + exp(yDot))

The logistic loss is normalized by the number of observations (line 30).

The method evaluates new loss against the convergence criterion eps (line 31) and returns a version of the pair (weights, losses) (line 32) if the SGD converges. The formula M4 that updates the weights is implemented by zipping the weights and the gradient (line 33). The next invocation of SGD selects the next observation in the shuffled sequence of observations using a modulo operator to avoid overflowing (line 34).

Finally, here is an example of implementation of the margin formula:

def margin(observation: Features, weights: Weights):Double =
  weights.drop(1).zip(observation.view)
             .aggregate(weights.head)(dot, _ + _)

This implementation of the margin includes the intercept with its weight associated to the bias, a feature of the value 1.0.

Note

Bias value

The purpose of the bias value is to prepend 1.0 to the vector of an observation so that it can be directly processed (that is, zip, dot) with the weights. For instance, a regression model for two-dimensional observations (x, y) has three weights (w₀, w₁, w₂). The bias value, +1, is prepended to the observations to compute the predicted value, 1.0. w₀ + x.w₁, +y.w₂.

This technique is used in the computation of the activation function of the multilayer perceptron as described in the Multilayerperceptronsection in Chapter 9, Artificial.

The sequence of observations is randomly shuffled before the SGD is computed. This implementation of shuffling relies on the Scala standard library method, scala.util.Random.shuffle [1:13].

Note

Fisher-Yates shuffling

The Training and classification subsection in the The multilayer perceptron (MLP) section of Chapter 10, Multilayer Perceptron, describes an alternative and efficient shuffling algorithm.

In order to train the model, we need to label input data. The labeling process consists of associating the relative price movement during a session (price at close/price at open – 1) with one of two configurations:

Volatile trading sessions with high trading volume
Trading sessions with low volatility and low trading volume

In this particular case, the labeling is automated because the relative price movement is extractable from raw data.

Note

Automated labelling

Although quite convenient, the automated creation of training labels is not without risk, as it may mislabel singular observations. This technique is used in our test for convenience; it is not recommended without a domain expert manually labeling observations.

Classifying observations

Once the model has been successfully created through training, it is available to classify new observation. The runtime classification of observations using the binomial logistic regression is implemented by the method classify:

def classify(obs: Features): Try[(Int, Double)] = 
  val linear = margin(obs, model.weights) 
           + model.weights(0)  //37
  val prediction = sigmoid(linear)
  (if(linear > 0.0) 1 else 0, prediction) //38
})

The method applies the logistic function to the linear inner product, linear, of the new observation, obs, and the weights of the model (line 37). The method returns the tuple (the predicted class of the observation {0, 1}, prediction value), where the class is defined by comparing the prediction to the boundary value 0.0 (line 38).

The computation of the margin as product of weights and observations is as follows:

def margin(obs: Features, weights: Weights): Double =
   weights.drop(1).zip(obs.view)
          .aggregate(0.0){case (s, (w,x)) => s + w*x, _ + _ }

The margin method is used in the classify method.

Step 6 – evaluating the model

The first step is to define the configuration parameters for the test: the maximum number of iterations, NITERS, convergence criterion EPS, learning rate ETA, and decision boundary used to label training observations, BOUNDARY, and the path to the training and test sets:

val NITERS = 4096; val EPS = 0.001; val ETA = 0.0001
val path_training = "supervised/regression/CSCO.csv"
val path_test = "supervised/regression/CSCO2.csv"

The various activities of creating and testing the model, loading, normalizing data, training the model, loading, and classifying test data is organized as a workflow using the monadic composition of the Try class:

for {
     path <- getPath(path_training)
      (volatility, vol) <- load(path)
      minMaxVec <- Try(new MinMaxVector(volatility))
      normVolatilityVol <- Try(minMaxVec.normalize(0.0, 1.0)) 
      classifier <- logRegr(normVolatilityVol, vol)

      testValues <- load(path_test) 
      normTestValue0 <- minMaxVec.normalize(testValues._1(0))
      class0 <- classifier.classify(normTestValue0)
      normTestValue1 <- minMaxVec.normalize(testValues._1(1)) 
      class1 <- classifier.classify(normTestValue1) 
} yield {
   val modelStr = model.toString
}

At first, the daily trading volatility and volume for the stock price (volatility, Vol) pairs are loaded from file (line 39). The workflow initializes the multi-dimensional normalizer, MinMaxVec (line 40), and uses it to normalize the training set (line 41). The logRegr method instantiates the binomial logistic regression, classifier (line 42). The test data, testValues, is loaded from file (line 43), normalized using the MinMaxVec, which has been already applied to training data (line 44) and classified (line 45).

The method load extracts the data (observations) of type XVSeries[Double] from the file. The heavy lifting is done by the extract method (line 46), and then the file handle is closed (line 47) before returning the vector of raw observations:

type Labels = (Vector[Features], Vector[Double])

def load(fileName: String): Try[Labels] =  {
  val src = Source.fromFile(fileName)
  val data = extract(src.getLines.map( _.split(",")).drop(1)) //46
  src.close; data //47
}

The method logRegr, implemented in the following code snippet, has two purposes:

Labeling automatic observations, obs, to generate real values after normalization (line 48)

Initializing (the instantiation and training of the model) the binomial logistic regression (line 49):

def logRegr(x: Vector[Features]):  Try[LogBinRegression] = Try {
  val (obs, real) = x
  val normReal = normalize(real)
                 .getOrElse(Vector.empty[Double])  //48
  new LogBinRegression(obs, normReal, NITERS, ETA, EPS) //49
}

Note

Validation

The simple classification in this test case is provided for illustrating the runtime application of the model. It does not constitute a validation of the model by any stretch of imagination. The next chapter digs into validation methodologies (refer to the Accessing a model section of Chapter 2, Data Pipelines, for more detail).

The training run is performed with three different values of the learning rate. The following chart illustrates the convergence of the batch gradient descent in the minimization of the cost given different values of learning rates:

Impact of learning rate on the SGD on the convergence of the loss

As expected, the execution of the optimizer with a higher learning rate produces the steepest descent in the cost function.

The execution of the test produces the following model:

iters = 495
weights: 0.859,-3.617,-6.927
input (0.0088, 4.10E7) normalized (0.063,0.061) class 1 prediction 0.5156
input (0.0694, 3.68E8) normalized (0.517,0.641) class 0 prediction 0.0012

These values may differ between experiments as the initial weights of the model are initialized randomly.

Note

Learning more about regressive models

The binomial logistic regression is merely used to illustrate the concept of training and prediction. It is described in detail in the Logistic regression section in Chapter 9, Regularization and Regression.

Scala for Machine Learning, Second Edition - Second Edition

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Scala for Machine Learning, Second Edition - Second Edition

Scala Machine Learning Projects

A Handbook of Mathematical Models with Python

Mastering Predictive Analytics with R

Let's kick the tires

Writing a simple workflow

Note

Step 1 – scoping the problem

Step 2 – loading data

Note

Note

Step 3 – preprocessing data

Immutable normalization

Note

Note

Note

Step 4 – discovering patterns

Analyzing data

Plotting data

Note

Note

Visualizing model features

Visualizing label

Step 5 – implementing the classifier

Note

Note

Selecting an optimizer

Note

Note

Note

Training the model

Note

Note

Note

Note

Note

Note

Classifying observations

Step 6 – evaluating the model

Note

Note