Book Image

Scala for Machine Learning

By : Patrick R. Nicolas
Book Image

Scala for Machine Learning

By: Patrick R. Nicolas

Overview of this book

Table of Contents (20 chapters)
Scala for Machine Learning
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Scala programming


Here is a partial list of coding practices and design techniques used throughout the book.

List of libraries and tools

The precompiled Scala for Machine Learning code is ScalaMl-2.11-0.99.jar located in the $ROOT/project/target/scala-2.11 directory. Not all the libraries are needed for every chapter. The list is as follows:

  • Java JDK 1.7 or 1.8 is required for all chapters

  • Scala 2.10.4 or higher is required for all chapters

  • Scala IDE for Eclipse 4.0 or higher

  • IntelliJ IDEA Scala plugin 13.0 or higher

  • sbt 0.13 or higher

  • Apache Commons Math 3.5+ is required for Chapter 3, Data Preprocessing, Chapter 4, Unsupervised Learning, and Chapter 6, Regression and Regularization

  • JFChart 1.0.7 is required for Chapter 1, Getting Started, Chapter 2, Hello World!, Chapter 5, Naïve Bayes Classifiers, and Chapter 9, Artificial Neural Networks

  • Iitb CRF 0.2 (including the LBGFS and Colt libraries) is required for Chapter 7, Sequential Data Models

  • LIBSVM 0.1.6 is required for Chapter 8, Kernel Models and Support Vector Machines

  • Akka framework 2.2 or higher is required for Chapter 12, Scalable Frameworks

  • Apache Spark/MLlib 1.3 or higher is required for Chapter 12, Scalable Frameworks

  • Apache Maven 3.3 or higher (required for Apache Spark 1.4 or higher)

Note

A note for Spark developers

The Scala library and compiler JAR files bundled with the assembly JAR file for Apache Spark contain a version of the Scala standard library and compiler JAR file that may conflict with an existing Scala library (that is, Eclipse default ScalaIDE library).

The lib directory contains the following JAR files related to the third-party libraries or frameworks used in the book: colt, CRF, LBFGS and LIBSVM.

Code snippets format

For the sake of readability of the implementation of algorithms, all nonessential code such as error checking, comments, exception, or import have been omitted. The following code elements are discarded in the code snippets presented in the book:

  • Comments:

    /**
    This class is defined as …
    */
    // The MathRuntime exception has to be caught here!
  • Validation of class parameters and method arguments:

    class BaumWelchEM(val lambda: HMMLambda ...) {
    require( lambda != null, "Lambda model is undefined")
  • Class qualifiers such as final and private:

    final protected class MLP[T <% Double] …
  • Method qualifiers and access control (final, private, and so on):

    final def inputLayer: MLPLayer
    private def recurse: Unit =
  • Serialization:

    class Config extends Serializable { … }
  • Validation of partial functions:

    val pfn: PartialFunction[U, V]
    pfn.isDefinedAt(u)
  • Validation of intermediate states:

    assert( p != None, " … ")
  • Java style exceptions:

    try { … }
    catch { case e: ArrayIndexOutOfBoundsException  => … }
    if (y < EPS)
       throw new IllegalStateException( … )
  • Scala style exceptions:

    Try(process(args)) match {
       case Success(results) => …
       case Failure(e) => …
    }
  • Nonessential annotations:

    @inline def mean = { … }
    @implicitNotFound("Conversion $T to Array[Int] undefined")
    @throws(classOf[IllegalStateException)
  • Logging and debugging code:

    m_logger.debug( …)
    Console.println( … )
  • Auxiliary and nonessential methods

Best practices

Encapsulation

One important objective while creating an API is to reduce the access to support a helper class. There are two options to encapsulate helper classes, as follows:

  • A package scope: The supporting classes are first-level classes with protected access

  • A class or object scope: The supported classes are nested in the main class

The algorithms presented in this book follow the first encapsulation pattern.

Class constructor template

The constructors of a class are defined in the companion object using apply and the class has a package scope (protected):

protected class A[T](val x: X, val y: Y,…) { … } 
object A {
  def apply[T](x: X, y: Y, ...): A[T] = new A(x, y,…)
  def apply[T](x: , ..): A[T] = new A(x, y0, …)
}

For example, the SVM class that implements the support vector machine is defined as follows:

final protected class SVM[T <: AnyVal](
    config: SVMConfig, 
    xt: XVSeries[T], 
    expected: DblVector)(implicit f: T => Double) 
  extends ITransform[Array[T]](xt) {

The SVM companion object is responsible for defining all the constructors (instance factories) relevant to the SVM protected class:

def apply[T <: AnyVal](
    config: SVMConfig, 
    xt: XVSeries[T], 
    expected: DblVector)(implicit f: T => Double): SVM[T] = 
  new SVM[T](config, xt, expected)

Companion objects versus case classes

In the preceding example, the constructors are explicitly defined in the companion object. Although the invocation of the constructor is very similar to the instantiation of case classes, there is a major difference; the Scala compiler generates several methods to manipulate an instance as regular data (equals, copy, hash, and so on).

Case classes should be reserved for single state data objects (no methods).

Enumerations versus case classes

It is quite common to read or hear discussions regarding the relative merit of enumerations and pattern matching with case classes in Scala [A:1]. As a very general guideline, enumeration values can be regarded as lightweight case classes or case classes can be considered as heavy weight enumeration values.

Let's take an example of a Scala enumeration that consists of evaluating the uniform distribution of the scala.util.Random library:

object A extends Enumeration {
  type TA = Value
  val A, B, C = Value
}

import A._
val counters = Array.fill(A.maxId+1)(0)
Range(0, 1000).foreach( _ => Random.nextInt(10) match {
  case 3 => counters(A.id) += 1
  …
  case _ => { }
})

The pattern matching is very similar to the Java's switch statement.

Let's consider the following example of pattern matching using case classes that selects a mathematical formula according to the input:

package AA {
  sealed abstract class A(val level: Int)
  case class AA extends A(3) { def f =(x:Double) => 23*x}
  …
}

import AA._
def compute(a: A, x: Double): Double = a match {
   case a: A => a.f(x)
   …
}

The pattern matching is performed using the default equals method, whose byte code is automatically set for each case class. This approach is far more flexible than the simple enumeration at the cost of extra computation cycles.

The advantages of using enumerations over case classes are as follows:

  • Enumerations involve less code for a single attribute comparison

  • Enumerations are more readable, especially for Java developers.

The advantages of using case classes are as follows:

  • Case classes are data objects and support more attributes than enumeration IDs

  • Pattern matching is optimized for sealed classes as the Scala compiler is aware of the number of cases

In short, you should use enumeration for single value constants and case classes to match data objects.

Overloading

Contrary to C++, Scala does not actually overload operators. Here is the definition of the very few operators used in code snippets:

  • +=: This adds an element to a collection or container

  • +: This sums two elements of the same type

Design template for immutable classifiers

The machine learning algorithms described in this book uses the following design pattern and components:

  • The set of configuration and tuning parameters for the classifier is defined in a class inheriting from Config (that is, SVMConfig).

  • The classifier implements a monadic data transformation of the ITransform type for which the model is implicitly generated from a training set (that is, SVM[T]). The classifier requires at least three parameters, which are as follows:

    • A configuration for the execution of the training and classification tasks

    • An input dataset, xt, of the Vector[T] type

    • A vector of labels or expected values

  • A model of type inherited from Model. The constructor is responsible for creating the model through training (that is, SVMModel).

Let's take a look at the following diagram:

A generic UML class diagram for classifiers

For example, the key components of the support vector machine package are the classifier SVMs:

final protected class SVM[T <: AnyVal](
    val config: SVMConfig, 
    val xt: XTSeries[Array[T]], 
    val labels: DblVector)(implicit f: T => Double)
  extends ITransform[Array[T]](xt) with Monitor[Double] {

  type V = 
  val model: Option[SVMModel] = { … }
  override def |> PartialFunction[Array[T], V]
  …
}

The training set is created by combining or zipping the input dataset xt with the labels or expected values expected. Once trained and validated, the model is available for prediction or classification.

This design has the main advantage of reducing the life cycle of a classifier: a model is either defined, available for classification, or is not created.

The configuration and model classes are implemented as follows:

final class SVMConfig(val formulation: SVMFormulation, 
    val kernel: SVMKernel, 
    val svmExec: SVMExecution) extends Config

class SVMModel(val svmmodel: svm_model) extends Model

Note

Implementation considerations

The validation phase is omitted in most of the practical examples throughout the book for the sake of readability.

Utility classes

Data extraction

A CSV file is the most common format used to store historical financial data. It is the default format used to import data throughout the book. The data source relies on a DataSourceConfig configuration class, as follows:

case class DataSourceConfig(pathName: String, normalize: Boolean, 
     reverseOrder: Boolean, headerLines: Int = 1)

The parameters of the DataSourceConfig class are as follows:

  • pathName: This is the relative pathname of a data file to be loaded if the argument is a file or the directory containing multiple input data files. Most of files are CSV files.

  • normalize: This is the flag that is used to specify whether the data has to be normalized over [0, 1].

  • reverseOrder: This is the flag that is used to specify whether the order of the data in the file has to be reversed (for example, a time series) if its value is true.

  • headerLines: This specifies the number of lines for the column headers and comments.

The data source DataSource implements data transformation of the ETransform type using an explicit configuration DataSourceConfig, as described in the Monadic data transformation section in Chapter 2, Hello World!:

final class DataSource(config: DataSourceConfig,
    srcFilter: Option[Fields => Boolean]= None)
  extends ETransform[DataSourceConfig](config) {

  type Fields = Array[String]
  type U = List[Fields => Double]
  type V = XVSeries[Double]
  override def |> : PartialFunction[U, Try[V]] 
  ...
}

The srcFilter argument specifies the filter or condition of some of the row fields to skip the dataset (that is, missing data or incorrect format). Being an explicit data transformation, the constructor for the DataSource class has to initialize the U input type and the V output type of the |> extracting method. The method takes the extractor from a row of literal values to double floating point values:

override def |> : PartialFunction[U, Try[V]] = {
  case fields: U if(!fields.isEmpty) =>load.map(data =>{ //1
    val convert = (f: Fields =>Double) => data._2.map(f(_))
    if( config.normalize)  //2
      fields.map(t => new MinMax[Double](convert(t)) //3
           .normalize(0.0, 1.0).toArray ).toVector //4
    else fields.map(convert(_)).toVector
  })
}

The data is loaded from the file using the load helper method (line 1). The data is normalized if required (line 2) by converting each literal to a floating point value using an instance of the MinMax class (line 3). Finally, the MinMax instance normalizes the sequence of floating point values (line 4).

The DataSource class implements a significant set of methods that are documented in the source code available online.

Data sources

The examples in the book rely on three different sources of financial data using the CSV format:

  • YahooFinancials: This is for Yahoo schema for the historical stock and ETF price

  • GoogleFinancials: This is for Google schema for the historical stock and ETF price

  • Fundamentals: This is for fundamental financial analysis ration (a CSV file)

Let's illustrate the extraction from a data source using YahooFinancials as an example:

object YahooFinancials extends Enumeration {
   type YahooFinancials = Value
   val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value
   val adjClose = ((s:Array[String]) =>
        s(ADJ_CLOSE.id).toDouble)  //5
   val volume =  (s: Fields) => s(VOLUME.id).toDouble
   …
   def toDouble(value: Value): Array[String] => Double = 
       (s: Array[String]) => s(value.id).toDouble
}

Let's take a look at an example of an application of a DataSource transformation: loading the historical stock data from the Yahoo finance site. The data is downloaded as a CSV formatted file. Each column is associated with an extractor function (line 5):

val symbols = Array[String]("CSCO", ...)  //6
val prices = symbols
       .map(s => DataSource(s"$path$s.csv",true,true,1))//7
       .map( _ |> adjClose ) //8

The list of stocks for which the historical data has to be downloaded is defined as an array of symbols (line 6). Each symbol is associated with a CSV file (that is, CSCO => resources/CSCO.csv) (line 7). Finally, the YahooFinancials extractor for the adjClose price is invoked (line 8).

The format for the financial data extracted from the Google financial pages are similar to the format used in the Yahoo finances pages:

object GoogleFinancials extends Enumeration {
   type GoogleFinancials = Value
   val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME = Value
   val close = ((s:Array[String]) =>s(CLOSE.id).toDouble)//5
   …
}

The YahooFinancials, YahooFinancials, and Fundamentals classes implement a significant number of methods that are documented in the source code available online.

Extraction of documents

The DocumentsSource class is responsible for extracting the date, title, and content of a list of text documents or text files. The class does not support HTML documents. The DocumentsSource class implements a monadic data transformation of the ETransform type with an explicit configuration of the SimpleDataFormat type:

class DocumentsSource(dateFormat: SimpleDateFormat,
    val pathName: String) 
  extends ETransform[SimpleDateFormat](dateFormat) {

 type U = Option[Long] //2
 type V = Corpus[Long]  //3

 override def |> : PartialFunction[U, Try[V]] = { //4
    case date: U if (filesList != None) => 
      Try( if(date == None ) getAll else get(date) )
 }
 def get(t: U): V = getAll.filter( _.date == t.get)
 def getAll: V  //5
 ...
}

The DocumentsSource class takes two arguments: the format of the date associated with the document and the name of the path in which the documents are located (line 1). Being an explicit data transformation, the constructor of the DocumentsSource class has to initialize the U input type (line 2) as a date and convert it into a Long and V output type (line 3) as a Corpus to extract the |> method.

The |> extractor generates a corpus associated with a specific date and converts it into a Long type (line 4). The getAll method does the heavy lifting to extract or sort documents (line 5).

The implementation of the getAll method as well as other methods of the DocumentsSource class are described in the documented source code available online.

DMatrix class

Some discriminative learning models require operations to be performed on rows and columns of a matrix. The DMatrix class facilitates the read and write operations on columns and rows:

class DMatrix(val nRows: Int, val nCols: Int, 
     val data: DblArray) {
 def apply(i: Int, j: Int): Double = data(i*nCols+j)
 def row(iRow: Int): DblArray = { 
   val idx = iRow*nCols
   data.slice(idx, idx + nCols)
 }
 def col(iCol: Int): IndexedSeq[Double] =
   (iCol until data.size by nCols).map( data(_) )
 def diagonal: IndexedSeq[Double] = 
    (0 until data.size by nCols+1).map( data(_))
 def trace: Double = diagonal.sum
  …
}

The apply method returns an element of the matrix. The row method returns a row array, and the col method returns the indexed sequence of column elements. The diagonal method returns the indexed sequence of diagonal elements, and the trace method sums the diagonal elements.

The DMatrix class supports normalization of elements, rows, and columns; transposition; and updation of elements, columns and rows. The DMatrix class implements a significant number of methods that are documented in the source code available online.

Counter

The Counter class implements a generic mutable counter for which the key is a parameterized type. The number of occurrences of a key is managed by a mutable hash map:

class Counter[T] extends mutable.HashMap[T, Int] {
  def += (t: T): type.Counter = super.put(t, getOrElse(t, 0)+1) 
  def + (t: T): Counter[T] = { 
   super.put(t, getOrElse(t, 0)+1); this 
  }
  def ++ (cnt: Counter[T]): type.Counter = { 
    cnt./:(this)((c, t) => c + t._1); this
  }
  def / (cnt: Counter[T]): mutable.HashMap[T, Double] = map { 
    case(str, n) => (str, if( !cnt.contains(str) ) 
      throw new IllegalStateException(" ... ")
        else n.toDouble/cnt.get(str).get )
  }
  …
}

The += operator updates the counter of the t key and returns itself. The + operator updates and then duplicates the updated counters. The ++ operator updates this counter with another counter. The / operator divides the count for each key by the counts of another counter.

The Counter class implements a significant set of methods that are documented in the source code available online.

Monitor

The Monitor class has two purposes:

  • It stores log information and error messages using the show and error methods

  • It collects and displays variables related to the recursive or iterative execution of an algorithm

The data is collected at each iteration or recursion and then displayed as a time series with iterations as x axis values:

trait Monitor[T] {
  protected val logger: Logger
  lazy val _counters = 
      new mutable.HashMap[String, mutable.ArrayBuffer[T]]()

  def counters(key: String): Option[mutable.ArrayBuffer[T]]
  def count(key: String, value: T): Unit 
  def display(key: String, legend: Legend)
      (implicit f: T => Double): Boolean
  def show(msg: String): Int = DisplayUtils.show(msg, logger)
  def error(msg: String): Int = DisplayUtils.error(msg, logger)
  ...
}

The counters method returns an array associated with a specific key. The count method updates the data associated with a key. The display method plots the time series. Finally, the show and error methods send information and error messages to the standard output.

The documented source code for the implementation of the Monitor class is available online.