Here is a partial list of coding practices and design techniques used throughout the book.
The precompile Scala for Machine learning is ScalaMl-2.11-0.99.2.jar
located in directory $ROOT/project/target/scala-2.11
.
Here is the complete list of recommended tools and libraries used in creating and running the Scala for machine learning:
Java JDK 1.7 or 1.8 for all chapters
Scala 2.11.8 or higher for all chapters
Scala IDE for Eclipse 4.0 or higher
IntelliJ IDEA Scala plug in 13.0 or higher
Sbt 0.13.1 or higher
Apache Commons Math 3.5 or higher for Chapter 3, Data Pre-processing, Chapter 4, Unsupervised Learning, and Chapter 12, Kernel Models and SVM
JFChart 1.0.1 in Chapter 1, Getting Started, Chapter 2, Data Pipelines, Chapter 5, Dimension Reduction, and Chapter 9, Regression and Regularization
Iitb CRF 0.2 (including LBGFS and Colt libraries) in Chapter 7, Sequential Data Models
LIBSVM 0.1.6 in Chapter 8, Monte Carlo Inference
Akka framework 2.3 or higher in Chapter 16, Parallelism in Scala and Akka
Apache Spark/MLlib 2.0 or higher in Chapter 17, Apache Spark MLlib
Apache Maven 3.5 or higher (required for Apache Spark 2.0 or higher)
Note
Note for Spark developers
The version of the Scala library and compiler JAR bundled with the assembly JAR for Apache Spark contains a version of the Scala standard library and compiler JAR file that may conflict with an existing Scala library (that is, Eclipse default ScalaIDE library).
The lib
directory contains the following JAR files related to the 3rd part libraries or frameworks used in the book: colt, CRF, LBFGS, and LIBSVM
For the sake of readability of the implementation of algorithms, all nonessential code such as error checking, comments, exception, or import is omitted. The following code elements are discarded in the code snippet presented in the book:
The following are the comments:
/** This class is defined as … */ // MathRuntime exception has to be caught here!
The validation of class parameters and method arguments are as follows:
class Columns(cols: List[Int]. …) {
require (cols.size > 0, "Cols is empty")
The code for class qualifiers such as final
and private
are as follows:
final protected class MLP[T: ToDouble] …
The code for method qualifiers and access control (final
and private
) is as follows:
final def inputLayer: MLPLayer private def recurse: Unit private[this] val eps = 1e-6
The code for serialization is as follows:
class Config extends Serializable {…}
The code for validation of partial functions is as follows:
val pfn: PartialFunction[U, V]
pfn.isDefinedAt(u)
The validation of intermediate state is as follows:
assert( p != None, " … ")
The following are the Java style exceptions:
try { … } catch { case e: ArrayIndexOutOfBoundsException => … } if (y < EPS) throw new IllegalStateException( … )
The following are the Scala style exceptions:
Try(process(args)) match { case Success(results) => … case Failure(e) => … }
The following are the nonessential annotation:
@inline def mean = … @implicitNotFound("Conversion $T to Array[Int] undefined") @throws(classOf[IllegalStateException)
The following is the logging and debugging code:
m_logger.debug( …) Console.println(…)
Auxiliary and nonessential methods
Let's walkthrough the practices in detail.
One important objective in creating an API is to reduce the access to the supporting or helper class. There are two options to encapsulate helper classes:
Package scope: The supporting classes are first-level class with protected access
Class or object scope: The supported classes are nested in the main class
The algorithms presented in the book follow the first encapsulation pattern.
The constructors of a class are defined in the companion object, using apply
and the class has package as scope (protected
):
protected class A[T](val x: X, val y: Y,…) { … } object A { def apply[T](x: X, y: Y, ...): A[T] = new A(x, y,…) def apply[T](x: , ..): A[T] = new A(x, y0, …) }
For example, the SVM
class that implements the support vector machine is defined as follows:
final protected class SVM[T: ToDouble]( config: SVMConfig, xt: Vector[Array[T]], expected: DblVec) extends ITransform[Array[T], Array[Double]] {
The companion object, SVM
, is responsible for defining all the constructors (instance factories) relevant to the protected class SVM
:
def apply[T: ToDouble]( config: SVMConfig, xt: Vector[Array[T]], expected: DblVec): SVM[T] = new SVM[T](config, xt, expected)
In the preceding example, the constructors are explicitly defined in the companion object. Although the invocation of the constructor is very similar to the instantiation of case classes, there is a major difference: the scala compiler generates several methods to manipulate an instance as regular data (equals, copy, and hash).
Case classes should be reserved for single state data objects (no methods).
It is quite common to read or hear discussions regarding the relative merit of enumerations and pattern matching with case classes in Scala. [A:1] As a very general guideline, enumeration values can be regarded as lightweight case classes or case classes can be considered as heavyweight enumeration values.
Let's take an example of Scala enumeration, which consists of evaluating the uniform distribution of scala.util.Random
:
object A extends Enumeration {
type TA = Value
val A, B, C = Value
}
import A._
val counters = Array.fill(A.maxId+1)(0)
(0 until 1000).foreach( _ => nextInt(10) match {
case 3 => counters(A.id) += 1
…
case _ => …
})
The pattern matching is very similar to the Java switch statement.
Let's consider the following example of pattern matching using case classes that select a mathematical formula according to the input:
package AA { sealed abstract class A(val level: Int) case class AA extends A(3) { def f =(x:Double) => 23*x} … } import AA._ def compute(a: A, x: Double): Double = a match { case a: AA => a.f(x) … }
The pattern matching is performed using the default equals method, where the byte code is automatically for each case class. This approach is far more flexible than the simple enumeration at the cost of extra computation cycles.
The advantages of using enumerations over case classes are as follows:
Enumerations involve less code for a single attribute comparison
Enumerations are more readable, especially for Java developers
The advantages of using case classes are as follows:
Case classes are data objects and support more attributes than enumeration IDs
Pattern matching is optimized for sealed classes as the Scala compiler is aware of the number of cases
Briefly, you should use enumeration for single value constant and case classes for matching data objects.
Contrary to C++, Scala does not actually overload operators. Here is the meaning of the very few operators used in code snippets:
+=
adds an element to a collection or container+
sums two elements of the same type
The machine learning algorithms described in Scala for Machine Learning use the following design pattern and components:
The set of configuration and tuning parameters for the classifier is defined in a class inheriting from
Config
(that is,SVMConfig
).The classifier implements a monadic data transformation of type
ITransform
for which the model is implicitly generated from a training set (that is,SVM[T]
). The classifier requires at least three parameters, which are as follows:A configuration for the execution of the training and classification tasks
An input data set
xt
of typeVector[T]
A vector of labels or expected values
A model of type inherited from Model. The constructor is responsible for creating the model through training (that is,
SVMModel
)
For example, the key components of the support vector machine package are the classifier SVM
:
final protected class SVM[T: ToDouble]( config: SVMConfig, xt:Vector[Array[T]], val labels: DblVec) extends ITransform[Array[T],Array[Double] { val model: Option[SVMModel] = _ override def |>: PartialFunction[Array[T],Array[Double]] … }
The training set is created by combining or zipping the input dataset xt
with the labels or expected values expected. Once trained and validated, the model is available for prediction or classification.
The design has the main advantage of reducing the lifecycle of a classifier: a model is either defined, available for classification, or is not created. The configuration and model classes are implemented as follows:
case class SVMConfig( formulation: SVMFormulation, kernel: SVMKernel, svmExec: SVMExecution) extends Config class SVMModel(val svmmodel: svm_model) extends Model
Let's look at the utility classes in detail.
The CSV file is the most common format used to store historical financial data. It is the default format for importing data used throughout the book. The data source relies on the DataSourceConfig
configuration class as follows:
case class DataSourceConfig(
path: String,
normalize: Boolean,
reverseOrder: Boolean,
headerLines: Int = 1)
The parameters for the DataSourceConfig
class are as follows:
path
: The relative pathname of a data file to be loaded if the argument is a file, or the directory containing multiple input data files. Most files were CSV filesnormalize
: Flag to specify whether the data must be normalized [0, 1]reverseOrder
: Flag to specify whether the order of the data in the file should be reversed (that is, time series) iftrue
headerLines
: Number of lines for the column headers and comments
The data source, DataSource
implements data transformation of type ETransform
using an explicit configuration, DataSourceConfig
as described in the Monadic data transformation
section of Chapter 2, Data Pipelines:
type Fields = Array[String] type U = List[Fields => Double] type V = Vector[Array[Double]] final class DataSource( config: DataSourceConfig, srcFilter: Option[Fields => Boolean] = None) extends ETransform[U, V](config) { override def |> : PartialFunction[U, Try[V]] ... }
The srcFilter
argument specifies the filter or condition of some of the row fields to skip the dataset (that is, missing data or incorrect format). Being an explicit data transformation, constructor for the DataSource
class has to initialize the input type U
and output type V
of the extracting method |>
. The method takes the extractor from a row of literal values to double floating-point values:
override def |> : PartialFunction[U, Try[V]] = { case fields: U if(!fields.isEmpty) =>load.map(data =>{ //1 val convert = (f: Fields =>Double) => data._2.map(f(_)) if( config.normalize) //2 fields.map(t => new MinMax[Double](convert(t)) //3 .normalize(0.0, 1.0).toArray ).toVector //4 else fields.map(convert(_)).toVector }) }
The data is loaded from the file using the helper method load
(line 1). The data is normalized if required (line 2) by converting each literal to a floating-point value using an instance of the MinMax
class (line 3). Finally, the MinMax
instance normalizes the sequence of floating point values (line 4).
The DataSource
class implements a significant set of methods that are documented in the source code available online.
The examples in the book rely on three different sources of financial data using the CSV format as follows:
YahooFinancials
: This is used for Yahoo schema for historical stock and ETF priceGoogleFinancials
: This is used for Google schema for historical stock and ETF priceFundamentals
: This is used for fundamental financial analysis ration (CSV file)
Let's illustrate the extraction from data source using YahooFinancials
as an example:
object YahooFinancials extends Enumeration { type YahooFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value val adjClose = ((s:Array[String]) => s(ADJ_CLOSE.id).toDouble) //5 val volume = (s: Fields) => s(VOLUME.id).toDouble … def toDouble(value: Value): Array[String] => Double = (s: Array[String]) => s(value.id).toDouble }
Let's look at an example of application of DataSource
transformation: loading stock historical data from Yahoo finance site. The data is downloaded as a CSV formatted file. Each column is associated to an extractor function (line 5):
val symbols = Array[String]("CSCO", ...) //6
val prices = symbols
.map(s => DataSource(s"$path$s.csv",true,true,1))//7
.map( _ |> adjClose ) //8
The list of stocks for which the historical data has to be downloaded is defined as an array of symbols (line 6). Each symbol is associated to a CSV file (that is, CSCO => resources/CSCO.csv
) (line 7). Finally, the YahooFinancials
extractor for the adjClose
price is invoked (line 8).
The format for the financial data extracted from the Google financial pages are similar to the format used with the Yahoo financial pages:
object GoogleFinancials extends Enumeration { type GoogleFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME = Value val close = (s:Array[String]) =>s(CLOSE.id).toDouble)//5 … }
The YahooFinancials
, GoogleFinancials
, and Fundamentals
classes implement a significant number of methods that are documented in the source code available online.
The DocumentsSource
class is responsible for extracting the most date, title, and content of a list of text documents or text files. The class does not support HTML documents. The DocumentsSource
class implements a monadic data transformation of type ETransform
with an explicit configuration of type SimpleDataFormat
:
type U = Option[Long] //2 type V = Corpus[Long] //3 class DocumentsSource( dateFormat: SimpleDateFormat, val pathName: String) extends ETransform[U, V](dateFormat) { override def |> : PartialFunction[U, Try[V]] = { //4 case date: U if filesList.isDefined => Try( if(date.isEmpty) getAll else get(date) ) } def get(t: U): V = getAll.filter( _.date == t.get) def getAll: V //5 ... }
The DocumentsSource
class takes two arguments—the format of the date associated to the document and the name of the path the documents are located in (line 1). Being an explicit data transformation, the constructor for the DocumentsSource
class should initialize the input type U
(line 2) as a date converted to a long
and output type V
(line 3) as a Corpus
for the extracting method |>.
The extractor |>
generates a corpus associated to a specific date converted to a long
type (line 4). The getAll
method does the heavy lifting for extracting or sorting the document (line 5).
The implementation of the getAll
method as well as other methods of the DocumentsSource
class is described in the documented source code available online.
Some of the discriminative learning models require operations performed on rows and columns of matrix. The DMatrix
class facilitates the read and write operations on columns and rows:
classDMatrix
( val nRows: Int, val nCols: Int, val data: Array[Double]) { defapply
(i: Int, j: Int): Double = data(i*nCols+j) defrow
(iRow: Int): Array[Double] = { val idx = iRow*nCols data.slice(idx, idx + nCols) } defcol
(iCol: Int): IndexedSeq[Double] = (iCol until data.size by nCols).map( data(_) ) defdiagonal
: IndexedSeq[Double] = (0 until data.size by nCols+1).map( data(_)) def trace: Double = diagonal.sum … }
The apply
method returns an element of the matrix. The row
method returns a row array and the col
method returns the indexed sequence of column elements. The diagonal
method returns the indexed sequence of diagonal elements and the trace
method sums the diagonal elements.
The DMatrix
class supports normalization of elements, rows, and columns, transposition, update of elements, columns, and rows. The DMatrix
class implements a significant number of methods that are documented in the source code available online.
The Counter
class implements a generic mutable counter for which the key is a parameterized type. The number of occurrences of a key is managed by a mutable hash map as follows:
class Counter[T] extends HashMap[T, Int] { def += (t: T): type.Counter = super.put(t, getOrElse(t, 0)+1) def + (t: T): Counter[T] = { super.put(t, getOrElse(t, 0)+1); this } def ++ (cnt: Counter[T]): type.Counter = { cnt./:(this)((c, t) => c + t._1); this def / (cnt: Counter[T]): HashMap[T, Double] = map { case(str, n) => (str, if( !cnt.contains(str) ) throw new IllegalStateException(" ... ") else n.toDouble/cnt.get(str).get ) } … }
The +=
operator updates the counter of key t
, and returns itself. The +
operator updates, and then duplicates the updated counters. The ++
operator updates this counter with another counter. The /
operator divides the count for each key by the counts of another counter.
The Counter
class implements a significant set of methods that are documented in the source code available online.
The Monitor
class has two purposes:
The data is collected at each iteration or recursion, then displayed as a time series with iterations as x axis values, as shown in the following code:
trait Monitor[T] { protected val logger: Logger lazy val _counters = HashMap[String, ArrayBuffer[T]]() def counters(key: String): Option[ArrayBuffer[T]] def count(key: String, value: T): Unit def display(key: String, legend: Legend) (implicit f: T => Double): Boolean def show(msg: String): Int = show(msg, logger) def error(msg: String): Int = error(msg, logger) ... }
The counters
method produces an array associated with a specific key. The count
method updates the data associated with a key. The display
method plots the time series. Finally, the methods show
and error
send information and error messages to the standard output.
The documented source code for the implementation of the Monitor
class is available online.