Book Image

Scala for Machine Learning, Second Edition - Second Edition

Book Image

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies. The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naïve Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You’ll move on to evolutionary computing, multibandit algorithms, and reinforcement learning. Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.
Table of Contents (27 chapters)
Scala for Machine Learning Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Scala programming


Here is a partial list of coding practices and design techniques used throughout the book.

List of libraries and tools

The precompile Scala for Machine learning is ScalaMl-2.11-0.99.2.jar located in directory $ROOT/project/target/scala-2.11.

Here is the complete list of recommended tools and libraries used in creating and running the Scala for machine learning:

  • Java JDK 1.7 or 1.8 for all chapters

  • Scala 2.11.8 or higher for all chapters

  • Scala IDE for Eclipse 4.0 or higher

  • IntelliJ IDEA Scala plug in 13.0 or higher

  • Sbt 0.13.1 or higher

  • Apache Commons Math 3.5 or higher for Chapter 3, Data Pre-processing, Chapter 4, Unsupervised Learning, and Chapter 12, Kernel Models and SVM

  • JFChart 1.0.1 in Chapter 1, Getting Started, Chapter 2, Data Pipelines, Chapter 5, Dimension Reduction, and Chapter 9, Regression and Regularization

  • Iitb CRF 0.2 (including LBGFS and Colt libraries) in Chapter 7, Sequential Data Models

  • LIBSVM 0.1.6 in Chapter 8, Monte Carlo Inference

  • Akka framework 2.3 or higher in Chapter 16, Parallelism in Scala and Akka

  • Apache Spark/MLlib 2.0 or higher in Chapter 17, Apache Spark MLlib

  • Apache Maven 3.5 or higher (required for Apache Spark 2.0 or higher)

Note

Note for Spark developers

The version of the Scala library and compiler JAR bundled with the assembly JAR for Apache Spark contains a version of the Scala standard library and compiler JAR file that may conflict with an existing Scala library (that is, Eclipse default ScalaIDE library).

The lib directory contains the following JAR files related to the 3rd part libraries or frameworks used in the book: colt, CRF, LBFGS, and LIBSVM

Code snippets format

For the sake of readability of the implementation of algorithms, all nonessential code such as error checking, comments, exception, or import is omitted. The following code elements are discarded in the code snippet presented in the book:

The following are the comments:

/**
This class is defined as …
*/
// MathRuntime exception has to be caught here!

The validation of class parameters and method arguments are as follows:

class Columns(cols: List[Int]. …) {
require (cols.size > 0, "Cols is empty")

The code for class qualifiers such as final and private are as follows:

final protected class MLP[T: ToDouble] …

The code for method qualifiers and access control (final and private) is as follows:

final def inputLayer: MLPLayer
private def recurse: Unit
private[this] val eps = 1e-6

The code for serialization is as follows:

class Config extends Serializable {…}

The code for validation of partial functions is as follows:

val pfn: PartialFunction[U, V]
pfn.isDefinedAt(u)

The validation of intermediate state is as follows:

assert( p != None, " … ")

The following are the Java style exceptions:

try { … }
catch { case e: ArrayIndexOutOfBoundsException  => … }
if (y < EPS)
   throw new IllegalStateException( … )

The following are the Scala style exceptions:

Try(process(args)) match {
   case Success(results) => …
   case Failure(e) => …
}

The following are the nonessential annotation:

@inline def mean =  … 
@implicitNotFound("Conversion $T to Array[Int] undefined")
@throws(classOf[IllegalStateException)

The following is the logging and debugging code:

m_logger.debug( …)
Console.println(…)

Auxiliary and nonessential methods

Best practices

Let's walkthrough the practices in detail.

Encapsulation

One important objective in creating an API is to reduce the access to the supporting or helper class. There are two options to encapsulate helper classes:

  • Package scope: The supporting classes are first-level class with protected access

  • Class or object scope: The supported classes are nested in the main class

The algorithms presented in the book follow the first encapsulation pattern.

Class constructor template

The constructors of a class are defined in the companion object, using apply and the class has package as scope (protected):

protected class A[T](val x: X, val y: Y,…) { … } 
object A {
  def apply[T](x: X, y: Y, ...): A[T] = new A(x, y,…)
  def apply[T](x: , ..): A[T] = new A(x, y0, …)
}

For example, the SVM class that implements the support vector machine is defined as follows:

final protected class SVM[T: ToDouble](
    config: SVMConfig, 
    xt: Vector[Array[T]], 
    expected: DblVec)
  extends ITransform[Array[T], Array[Double]] {

The companion object, SVM, is responsible for defining all the constructors (instance factories) relevant to the protected class SVM:

def apply[T: ToDouble](
    config: SVMConfig, 
    xt: Vector[Array[T]], 
    expected: DblVec): SVM[T] = 
new SVM[T](config, xt, expected)

Companion objects versus case classes

In the preceding example, the constructors are explicitly defined in the companion object. Although the invocation of the constructor is very similar to the instantiation of case classes, there is a major difference: the scala compiler generates several methods to manipulate an instance as regular data (equals, copy, and hash).

Case classes should be reserved for single state data objects (no methods).

Enumerations versus case classes

It is quite common to read or hear discussions regarding the relative merit of enumerations and pattern matching with case classes in Scala. [A:1] As a very general guideline, enumeration values can be regarded as lightweight case classes or case classes can be considered as heavyweight enumeration values.

Let's take an example of Scala enumeration, which consists of evaluating the uniform distribution of scala.util.Random:

object A extends Enumeration {
  type TA = Value
  val A, B, C = Value
}

import A._
val counters = Array.fill(A.maxId+1)(0)
(0 until 1000).foreach( _ => nextInt(10) match {
  case 3 => counters(A.id) += 1
  …
  case _ => …
})

The pattern matching is very similar to the Java switch statement.

Let's consider the following example of pattern matching using case classes that select a mathematical formula according to the input:

package AA {
  sealed abstract class A(val level: Int)
  case class AA extends A(3) { def f =(x:Double) => 23*x}
  …
}

import AA._
def compute(a: A, x: Double): Double = a match {
   case a: AA => a.f(x)
   …
}

The pattern matching is performed using the default equals method, where the byte code is automatically for each case class. This approach is far more flexible than the simple enumeration at the cost of extra computation cycles.

The advantages of using enumerations over case classes are as follows:

  • Enumerations involve less code for a single attribute comparison

  • Enumerations are more readable, especially for Java developers

The advantages of using case classes are as follows:

  • Case classes are data objects and support more attributes than enumeration IDs

  • Pattern matching is optimized for sealed classes as the Scala compiler is aware of the number of cases

Briefly, you should use enumeration for single value constant and case classes for matching data objects.

Overloading

Contrary to C++, Scala does not actually overload operators. Here is the meaning of the very few operators used in code snippets:

  • += adds an element to a collection or container

  • + sums two elements of the same type

Design template for immutable classifiers

The machine learning algorithms described in Scala for Machine Learning use the following design pattern and components:

  • The set of configuration and tuning parameters for the classifier is defined in a class inheriting from Config (that is, SVMConfig).

  • The classifier implements a monadic data transformation of type ITransform for which the model is implicitly generated from a training set (that is, SVM[T]). The classifier requires at least three parameters, which are as follows:

    • A configuration for the execution of the training and classification tasks

    • An input data set xt of type Vector[T]

    • A vector of labels or expected values

  • A model of type inherited from Model. The constructor is responsible for creating the model through training (that is, SVMModel)

For example, the key components of the support vector machine package are the classifier SVM:

final protected class SVM[T: ToDouble](
    config: SVMConfig, 
    xt:Vector[Array[T]], 
    val labels: DblVec)
  extends ITransform[Array[T],Array[Double] {

  val model: Option[SVMModel] = _
  override def |>: PartialFunction[Array[T],Array[Double]]
  …
}

The training set is created by combining or zipping the input dataset xt with the labels or expected values expected. Once trained and validated, the model is available for prediction or classification.

The design has the main advantage of reducing the lifecycle of a classifier: a model is either defined, available for classification, or is not created. The configuration and model classes are implemented as follows:

case class SVMConfig(
   formulation: SVMFormulation, 
   kernel: SVMKernel, 
   svmExec: SVMExecution) extends Config

class SVMModel(val svmmodel: svm_model) extends Model

Note

Implementation considerations

The validation phase is omitted in most of the practical examples throughout the book for the sake of readability.

Utility classes

Let's look at the utility classes in detail.

Data extraction

The CSV file is the most common format used to store historical financial data. It is the default format for importing data used throughout the book. The data source relies on the DataSourceConfig configuration class as follows:

case class DataSourceConfig(
  path: String, 
  normalize: Boolean, 
  reverseOrder: Boolean, 
  headerLines: Int = 1)

The parameters for the DataSourceConfig class are as follows:

  • path: The relative pathname of a data file to be loaded if the argument is a file, or the directory containing multiple input data files. Most files were CSV files

  • normalize: Flag to specify whether the data must be normalized [0, 1]

  • reverseOrder: Flag to specify whether the order of the data in the file should be reversed (that is, time series) if true

  • headerLines: Number of lines for the column headers and comments

The data source, DataSource implements data transformation of type ETransform using an explicit configuration, DataSourceConfig as described in the Monadic data transformation section of Chapter 2, Data Pipelines:

type Fields = Array[String]
type U = List[Fields => Double]
type V = Vector[Array[Double]]

final class DataSource(
  config: DataSourceConfig,
  srcFilter: Option[Fields => Boolean] = None)
extends ETransform[U, V](config) {

  override def |> : PartialFunction[U, Try[V]] 
  ...
}

The srcFilter argument specifies the filter or condition of some of the row fields to skip the dataset (that is, missing data or incorrect format). Being an explicit data transformation, constructor for the DataSource class has to initialize the input type U and output type V of the extracting method |>. The method takes the extractor from a row of literal values to double floating-point values:

override def |> : PartialFunction[U, Try[V]] = {
  case fields: U if(!fields.isEmpty) =>load.map(data =>{ //1
    val convert = (f: Fields =>Double) => data._2.map(f(_))

    if( config.normalize)  //2
      fields.map(t => new MinMax[Double](convert(t)) //3
           .normalize(0.0, 1.0).toArray ).toVector //4
    else fields.map(convert(_)).toVector
  })
}

The data is loaded from the file using the helper method load (line 1). The data is normalized if required (line 2) by converting each literal to a floating-point value using an instance of the MinMax class (line 3). Finally, the MinMax instance normalizes the sequence of floating point values (line 4).

The DataSource class implements a significant set of methods that are documented in the source code available online.

Financial data sources

The examples in the book rely on three different sources of financial data using the CSV format as follows:

  • YahooFinancials: This is used for Yahoo schema for historical stock and ETF price

  • GoogleFinancials: This is used for Google schema for historical stock and ETF price

  • Fundamentals: This is used for fundamental financial analysis ration (CSV file)

Let's illustrate the extraction from data source using YahooFinancials as an example:

object YahooFinancials extends Enumeration {
   type YahooFinancials = Value

   val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value
   val adjClose = ((s:Array[String]) =>
        s(ADJ_CLOSE.id).toDouble)  //5
   val volume = (s: Fields) => s(VOLUME.id).toDouble
   …
   def toDouble(value: Value): Array[String] => Double = 
       (s: Array[String]) => s(value.id).toDouble
}

Let's look at an example of application of DataSource transformation: loading stock historical data from Yahoo finance site. The data is downloaded as a CSV formatted file. Each column is associated to an extractor function (line 5):

val symbols = Array[String]("CSCO", ...)  //6
val prices = symbols
       .map(s => DataSource(s"$path$s.csv",true,true,1))//7
       .map( _ |> adjClose ) //8

The list of stocks for which the historical data has to be downloaded is defined as an array of symbols (line 6). Each symbol is associated to a CSV file (that is, CSCO => resources/CSCO.csv) (line 7). Finally, the YahooFinancials extractor for the adjClose price is invoked (line 8).

The format for the financial data extracted from the Google financial pages are similar to the format used with the Yahoo financial pages:

object GoogleFinancials extends Enumeration {
   type GoogleFinancials = Value
   val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME = Value
   val close = (s:Array[String]) =>s(CLOSE.id).toDouble)//5
   …
}

The YahooFinancials, GoogleFinancials, and Fundamentals classes implement a significant number of methods that are documented in the source code available online.

Documents extraction

The DocumentsSource class is responsible for extracting the most date, title, and content of a list of text documents or text files. The class does not support HTML documents. The DocumentsSource class implements a monadic data transformation of type ETransform with an explicit configuration of type SimpleDataFormat:

type U = Option[Long] //2
type V = Corpus[Long] //3
class DocumentsSource(
  dateFormat: SimpleDateFormat,
  val pathName: String) 
extends ETransform[U, V](dateFormat) {

  override def |> : PartialFunction[U, Try[V]] = { //4
    case date: U if filesList.isDefined => 
      Try( if(date.isEmpty) getAll else get(date) )
  }
 
  def get(t: U): V = getAll.filter( _.date == t.get)
  def getAll: V //5
 ...
}

The DocumentsSource class takes two arguments—the format of the date associated to the document and the name of the path the documents are located in (line 1). Being an explicit data transformation, the constructor for the DocumentsSource class should initialize the input type U (line 2) as a date converted to a long and output type V (line 3) as a Corpus for the extracting method |>.

The extractor |> generates a corpus associated to a specific date converted to a long type (line 4). The getAll method does the heavy lifting for extracting or sorting the document (line 5).

The implementation of the getAll method as well as other methods of the DocumentsSource class is described in the documented source code available online.

DMatrix class

Some of the discriminative learning models require operations performed on rows and columns of matrix. The DMatrix class facilitates the read and write operations on columns and rows:

class DMatrix(
  val nRows: Int, 
  val nCols: Int, 
  val data: Array[Double]) {

  def apply(i: Int, j: Int): Double = data(i*nCols+j)
  def row(iRow: Int): Array[Double] = { 
    val idx = iRow*nCols
    data.slice(idx, idx + nCols)
  }

  def col(iCol: Int): IndexedSeq[Double] =
      (iCol until data.size by nCols).map( data(_) )
  def diagonal: IndexedSeq[Double] = 
      (0 until data.size by nCols+1).map( data(_))
  def trace: Double = diagonal.sum
  …
}

The apply method returns an element of the matrix. The row method returns a row array and the col method returns the indexed sequence of column elements. The diagonal method returns the indexed sequence of diagonal elements and the trace method sums the diagonal elements.

The DMatrix class supports normalization of elements, rows, and columns, transposition, update of elements, columns, and rows. The DMatrix class implements a significant number of methods that are documented in the source code available online.

Counter

The Counter class implements a generic mutable counter for which the key is a parameterized type. The number of occurrences of a key is managed by a mutable hash map as follows:

class Counter[T] extends HashMap[T, Int] {
  def += (t: T): type.Counter = super.put(t, getOrElse(t, 0)+1) 
  def + (t: T): Counter[T] = { 
      super.put(t, getOrElse(t, 0)+1); this 
  }
  def ++ (cnt: Counter[T]): type.Counter = { 
      cnt./:(this)((c, t) => c + t._1); this
 
  def / (cnt: Counter[T]): HashMap[T, Double] = map { 
    case(str, n) => (str, if( !cnt.contains(str) ) 
	  throw new IllegalStateException(" ... ")
	    else n.toDouble/cnt.get(str).get )
  }
  …
}

The += operator updates the counter of key t, and returns itself. The + operator updates, and then duplicates the updated counters. The ++ operator updates this counter with another counter. The / operator divides the count for each key by the counts of another counter.

The Counter class implements a significant set of methods that are documented in the source code available online.

Monitor

The Monitor class has two purposes:

  • Log information and error messages using the methods show and error

  • Collect and display variables related to the recursive or iterative execution of an algorithm

The data is collected at each iteration or recursion, then displayed as a time series with iterations as x axis values, as shown in the following code:

trait Monitor[T] {
  protected val logger: Logger
  lazy val _counters = HashMap[String, ArrayBuffer[T]]()

  def counters(key: String): Option[ArrayBuffer[T]]
  def count(key: String, value: T): Unit 
  def display(key: String, legend: Legend)
      (implicit f: T => Double): Boolean
  def show(msg: String): Int = show(msg, logger)
  def error(msg: String): Int = error(msg, logger)
  ...
}

The counters method produces an array associated with a specific key. The count method updates the data associated with a key. The display method plots the time series. Finally, the methods show and error send information and error messages to the standard output.

The documented source code for the implementation of the Monitor class is available online.