Book Image

Scala for Machine Learning, Second Edition - Second Edition

Book Image

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies. The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naïve Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You’ll move on to evolutionary computing, multibandit algorithms, and reinforcement learning. Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.
Table of Contents (27 chapters)
Scala for Machine Learning Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Index

Why Scala?


Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular map-reduce model for distributed computation of large amounts of data.

Note

Disclaimer

This section does not constitute a formal introduction or description of the features of Scala. It merely mentions some of its features that are valuable to machine learning practitioners. Experienced Scala developers may skip to the next section.

Among the capabilities of the language, the following features are deemed essential in machine learning and statistical analysis.

Scala as a functional language

There are many functional features in Scala which may unsettle software engineers with experience in object-oriented programming. This section deals specifically with monadic and functorial representations of data. Functors and monads are concepts defined in the field of mathematics known as category theory. Formerly:

  • A functor is a data type that defines how a transformation known as a map applies to it. Scala implements functors as type classes with a map method.

  • A monad is a wrapper around an existing data type. It applies a transformation to a data of wrapper type and returns a value of the same wrapper type. Scala implements monads as type classes with unit and flatMap methods. Monads extends functors in Scala.

Abstraction

Functors and monads are important concepts in functional programming.

Functors and monads are derived from category and group theory; they allow developers to create a high-level abstraction, as illustrated in the following Scala libraries:

  • Scalaz: https://github.com/scalaz

  • Twitter's Algebird: https://github.com/twitter/algebird

  • Google's Breeze : https://github.com/dlwh/breeze

In mathematics, a category M is a structure that is defined by the following:

  • Objects of some type {x e X, y Є Y, z Є Z, …}

  • Morphisms or maps applied to these objects x Є X, y Є Y, f: x -› y

  • Composition of morphisms f: x -› y, g: y -› z => g o f: x -› z

Covariant, contravariant functors, and bi-functors are well-understood concepts in algebraic topology that are related to manifold and vector bundles. They are commonly used in differential geometry for the generation of non-linear models.

Higher kinded types

Higher kinded types (HKTs) are abstractions of types. They generate a new type from existing types. Let's consider the following parameterized trait:

trait M[T] {  .  } 

A higher kinded type H over a trait M is defined as follows:

       trait H[M[_]]; class H[M[_]] 

Functors and monads are higher kinded types.

How are higher kinded types relevant to data analysis and machine learning?

Scientists define observations as sets or vectors of features.

Classification problems rely on the estimation of the similarity between vectors of observations. One technique consists of comparing two vectors by computing the normalized inner (or dot) product. A co-vector is defined as a linear map α of vector to the inner product (field).

Note

Inner product

M1: Definition of inner product <.> and co-vector α:

Let's define a vector as a constructor from any field, _ => Vector[_]. A co-vector is then defined as the mapping function of a vector to its field: Vector[_].

Let's then define a two-dimension (two types or fields) higher kinded structure, Hom, that can be defined as either a vector or a co-vector by fixing one of the two types:

type Hom[T] = {
  type Right[X] = Function1[X,T] // Co-vector
  type Left[X] = Function1[T,X]   // Vector
 }

Note

Tensors and manifolds

Vector and co-vector are classes of tensor (contravariant and covariant). Tensors (fields) are used in manifold learning non-linear models and in the generation of kernel functions. Manifolds are briefly introduced in the Manifolds section in. The topic of tensor fields in manifold learning is beyond the scope of this book.

The projections of the higher-kind Hom to Right or Left single parameter types are known as functors:

  • Covariant functor for the right projection

  • Contravariant functor for the left projection

Functors

A covariant functor is a mapping function, such as F: C => C, with the following properties:

  • If f: x -› y is a morphism on C then F(x) -› F(y) is also a morphism on C

  • If id: x -› x is the identity morphism on C then F(id) is also an identity morphism on C

  • If g: y -› z is also a morphism on C then F(g o f) = F(g) o F(f)

The definition of the covariant functor is F[U => V] := F[U] => F[V]. Its implementation in Scala is:

trait Functor[M[_]]{ def map[U,V](m: M[U])(f: U=>V): M[V] }

For example, let's consider an observation defined as an n dimension vector of type T, Obs[T]. The constructor for the observation can be represented as Function1[T,Obs]. Its functor, ObsFunctor, is implemented as:

trait ObsFunctor[T] extends Functor[(Hom[T])#Left] { self =>
  override def map[U,V](vu: Function1[T,U])(f: U =>V): 
    Function1[T,V] = f.compose(vu)
}

The functor is qualified as a covariant functor because the morphism is applied to the return type of the element of Obs, Function1[T, Obs]. The projection of the two parameters types Hom to a vector is implemented as (Hom[T])#Left.

A contravariant functor is a mapping function, F: C => C, with the following properties:

  • If f: x -› y is a morphism on C then F(x) -› F(y) is also a morphism on C

  • If id: x -› x is the identity morphism on C then F(id) is also an identity morphism on C

  • If g: y -› z is also a morphism on C then F(g o f) = F(f) o F(g)

The definition of the contravariant functor is F[U => V] := F[V] => F[U], as follows:

trait CoFunctor[M[_]]{ def map[U,V](m: M[U])(f: V=>U): M[V]}

Note that the input and output types in the morphism f are reversed from the definition of a covariant functor. The constructor for the co-vector can be represented as Function1[Obs,T]. Its functor, CoObsFunctor, is implemented as:

trait CoObsFunctor[T] extends CoFunctor[(Hom[T])#Right] {
  self =>
      override def map[U,V](vu: Function1[U,T])(f: V =>U): 
       Function1[V,T] = f.andThen(vu)
}

Monads

Monads are structures in algebraic topology related to category theory. Monads extend the concept of functors to allow composition known as the monadic composition of morphisms on a single type. They enable the chaining or weaving of computation into a sequence of steps sometimes known as a data pipeline. The collections bundled with the Scala standard library (List, Map…) are constructed as monads [1:1].

Monads provide the ability for those collections to do the following:

  • Create the collection

  • Transform the elements of the collection

  • Flatten nested collections

The following Scala definition of a monad as a trait illustrates the concept of a higher kinded Monad trait for type M:

trait Monad[M[_]] {
  def unit[T](a: T): M[T]   
  def map[U,V](m: M[U])(f U =>V): M[V]              
  def flatMap[U,V](m: M[U])(f: U =>M[V]): M[V] 
}

Monads are therefore critical in machine learning as they enable the composition of multiple data transformation functions into a sequence or workflow. This property is applicable to any type of complex scientific computation [1:2].

Note

Monadic composition of kernel functions

Monads are used in the composition of kernel functions in the Kernel functions monadic composition section in Chapter 12, Kernel Models and Support Vector Machines.

Scala as an object oriented language

Machine learning models are generated through sequences of tasks or dataflows that demand a modular design.

As an object-oriented programming language, Scala allows developers to do the following:

  • Define high-level component abstraction

  • Allow different developers to work concurrently on different components

  • Reuse code

  • Isolate functionality for easier debugging and testing (unit tests)

You may wonder how Scala fares as an object-oriented programming against Java.

Note

Scala versus Java

Scala is the purest form of object oriented language than Java. It does not support static methods (static methods are methods of singletons) and primitive types.

One important facet of object oriented programming is the ability to change modules or implement functionality on the fly, without the need to recompile the client code. This technique is known as dependency injection. Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits [1:3]. One of the most commonly used dependency injection patterns, the cake pattern, is described in the Building workflows with mixins section in Chapter 2, Data Pipelines.

Scala as a scalable language

As seen previously, functors and monads enable the parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, actors are one of the core elements that make Scala scalable. Actors provide Scala developers with a high level of abstraction to build scalable, distributed, and concurrent applications. Actors hide the nitty-gritty implementation of concurrency and the management of the underlying threads pool. Actors communicate through asynchronous immutable messages. A distributed computing Scala framework such as Akka or Apache Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Apache Spark are described in detail in the last chapter of this book [1:4].

Concisely, a workflow is implemented as a sequence of activities or computational tasks. These tasks consist of higher-order Scala methods such as flatMap, map, fold, reduce, collect, join, or filter applied to a large collection of observations. Scala provides developers with the tools to partition datasets and execute the tasks through a cluster of actors. Scala also supports message dispatching and routing between local and remote actors. A developer may decide to deploy a workflow either locally or across multiple CPU cores and servers with very few code alterations.

The following figure visualizes the different elements of the definition and deployment of a workflow (or data pipeline):

Deployment of a workflow for model training as a distributed computation

In the preceding diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 in a similar way to a scheduler. These tasks are actually executed over multiple worker nodes, and are implemented by actors. The master node or actor exchanges messages with the workers to manage the state of the execution of the workflow, as well as its reliability, as illustrated in the Scalability with actors section of Chapter 16, Parallelism with Scala and Akka. The high availability of these tasks is maintained through a hierarchy of supervising actors.

Note

Domain-specific languages (DSLs)

Scala embeds DSLs natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. A notorious application of DSLs is the definition of the emulation of the syntax use in the MATLAB program, familiar to most data scientists.