Scala for Machine Learning, Second Edition

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies. The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naïve Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You’ll move on to evolutionary computing, multibandit algorithms, and reinforcement learning. Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.

Scala for Machine Learning Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started

Mathematical notations for the curious

Why machine learning?

Why Scala?

Model categorization

Taxonomy of machine learning algorithms

Leveraging Java libraries

Tools and frameworks

Source code

Let's kick the tires

Summary

Data Pipelines

Modeling

Defining a methodology

Monadic data transformation

Workflow computational model

Profiling data

Assessing a model

Summary

Data Preprocessing

Time series in Scala

Moving averages

Fourier analysis

The discrete Kalman filter

Alternative preprocessing techniques

Summary

Unsupervised Learning

K-mean clustering

Expectation-Maximization (EM)

Summary

Dimension Reduction

Challenging model complexity

The divergences

Principal components analysis (PCA)

Nonlinear models

Summary

Naïve Bayes Classifiers

Probabilistic graphical models

Naïve Bayes classifiers

Multivariate Bernoulli classification

Naïve Bayes and text mining

Pros and cons

Summary

Sequential Data Models

Markov decision processes

The hidden Markov model (HMM)

Conditional random fields

Regularized CRF and text analytics

Comparing CRF and HMM

Performance consideration

Summary

Monte Carlo Inference

The purpose of sampling

Gaussian sampling

Monte Carlo approximation

Bootstrapping with replacement

Markov Chain Monte Carlo (MCMC)

Summary

Regression and Regularization

Linear regression

Regularization

Numerical optimization

Logistic regression

Summary

Multilayer Perceptron

Feed-forward neural networks (FFNN)

The multilayer perceptron (MLP)

Evaluation

Benefits and limitations

Summary

Deep Learning

Sparse autoencoder

Restricted Boltzmann Machines (RBMs)

Convolution neural networks

Kernel Models and SVM

Kernel functions

The support vector machine (SVM)

Performance considerations

Summary

Evolutionary Computing

Evolution

Genetic algorithms and machine learning

Genetic algorithm components

Implementation

GA for trading strategies

Advantages and risks of genetic algorithms

Summary

Multiarmed Bandits

K-armed bandit

Thompson sampling

Upper bound confidence

Summary

Reinforcement Learning

Reinforcement learning

Learning classifier systems

Summary

Parallelism in Scala and Akka

Overview

Scala

Scalability with Actors

Akka

Summary

Apache Spark MLlib

Overview

Apache Spark core

MLlib library

Reusable ML pipelines

Extending Spark

Streaming engine

Performance evaluation

Pros and cons

Summary

Basic Concepts

Scala programming

Mathematics

Finances 101

Suggested online courses

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Taxonomy of machine learning algorithms

The purpose of machine learning is to teach computers to execute tasks without human intervention. An increasing number of applications, such as genomics, social networking, advertising, or risk analysis generate a very large amount of data which can be analyzed or mined to extract knowledge or insight into a process, a customer, or an organization. Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data [1:5].

Data mining is the process of extracting or identifying patterns in a dataset.

Unsupervised learning

The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. The process known as density estimation in statistics is broken down into two categories: the discovery of data clusters and the discovery of latent factors. The methodology consists of processing input data to understand patterns similar to the natural learning process in infants or animals.

Unsupervised learning does not require labeled data (or expected values), and therefore, is easy to implement and execute because no expertise is needed to validate an output. However, it is possible to label the output of a clustering algorithm and use it in future classifications.

Clustering

The purpose of data clustering is to partition a collection of data into a number of clusters or data segments. Practically, a clustering algorithm is used to organize observations into clusters by minimizing the distance between observations within a cluster and maximizing the distance between observations across clusters. A clustering algorithm consists of the following steps:

Creating a model making an assumption on the input data
Selecting the objective function or goal of the clustering
Evaluation of one or more algorithms to optimize the objective function

Data clustering is also known as data segmentation or data partitioning.

Dimension reduction

Dimension reduction techniques aim to find the smallest, yet most relevant, set of features needed to build a reliable model. There are many reasons for reducing the number of features or parameters in a model, from avoiding overfitting to reducing computation costs.

There are many ways to classify the different techniques used to extract knowledge from data using unsupervised learning. The taxonomy breaks down these techniques according to their purpose, although the list is far from being exhaustive, as shown in the following diagram:

Taxonomy of unsupervised learning algorithms

Supervised learning

The best analogy for supervised learning is function approximation or curve fitting. In its simplest form, supervised learning attempts to find a relation or function f: x → y using a training set {x, y}. Supervised learning is far more accurate than any other learning strategy as long as the input, labeled data is available and reliable. The downside is that a domain expert may be required to label (or tag) data as a training set.

Supervised machine learning algorithms can be broken down into two categories:

Generative models
Discriminative models

Generative models

In order to simplify the description of a statistics formula, we adopt the following simplification: the probability of an event X is the same as the probability of the discrete random variable X having a value x: p(X) = p(X=x):

The notation for the joint probability is p(X,Y) = p(X=x, Y=y)
The notation for the conditional probability is p(X|Y) = p(X=x|Y=y)

Generative models attempt to fit a joint probability distribution p(X,Y) of two events (or random variables), X and Y, representing two set of observed and hidden variables, x, y. Discriminative models compute the conditional probability p(Y| X) of an event or random variable Y of hidden variables y, given an event or random variable X of observed variables x. Generative models are commonly introduced through Bayes' rule. The conditional probability of an event Y given an event X is computed as the product of the conditional probability of the event X given the event Y and the probability of the event X, normalized by the probability of event Y [1:6].

Note

Bayes' rule

Joint probability for independent random variables X=x and Y=y:

Conditional probability of a random variable Y = y, given X = x:

Bayes' formula

Bayes' rule is the foundation of the Naïve Bayes classifier, which is described in the Introducing the multinomial Naïve Bayes section in Chapter 6, Naïve Bayes Classifiers.

Discriminative models

Contrary to generative models, discriminative models compute the conditional probability p(Y|X) directly, using the same algorithm for training and classification.

Generative and discriminative models have their respective advantages and drawbacks. Novice data scientists learn to match the appropriate algorithm to each problem through experimentation. Here are some brief guidelines describing which type of models make sense according to the objective or criteria of the project:

Objective	Generative models	Discriminative models
Accuracy	Highly dependent on the training set.	Depends on training set and algorithm configuration (that is, kernel functions).
Modeling requirements	There is a need to model both observed and hidden variables, which requires a significant amount of training.	The quality of the training set does not have to be as rigorous as for generative models.
Computation cost	It is usually low. For example, any graphical method derived from Bayes' rule has low overhead.	Most algorithms rely on optimization of a convex function with significant performance overhead.
Constraints	These models assume some degree of independence among the model features.	Most discriminative algorithms accommodate dependencies between features.

We can further refine the taxonomy of supervised learning algorithms by segregating arbitrary, between sequential and random variables for generative models and by breaking down discriminative methods as applied to continuous processes (regression) and discrete processes (classification).The following figure illustrates a partial taxonomy of supervised learning algorithms:

Taxonomy of supervised learning algorithms

Semi-supervised learning

Semi-supervised learning is used to build models from a dataset with incomplete labels. Manifold learning and information geometry algorithms are commonly applied to large datasets that are partially labeled. The description of semi-supervised learning techniques is beyond the scope of the book.

Reinforcement learning

Reinforcement learning is not as well understood as supervised and unsupervised learning outside the realm of robotics or game strategy. However, since the 1990s, genetic-algorithm-based classifiers have become increasingly popular in solving problems that require the collaboration of a system with a domain expert.

For some types of applications, reinforcement learning algorithms output a set of recommended actions for the adaptive system to execute. In its simplest form, these algorithms estimate the best course of action. Most complex systems based on reinforcement learning establish and update policies that can be vetoed by an expert, if necessary. The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on a partially observable state.

Genetic algorithms are not usually considered part of the reinforcement learning toolbox. However, advanced models such as learning classifier systems use genetic algorithms to classify and reward the most performing rules and policies.

As with the two previous learning strategies, reinforcement learning models can be categorized as Markovian or evolutionary. The following figure represents a partial taxonomy of the reinforcement learning algorithms:

Taxonomy of reinforcement learning algorithms

The genetic algorithm is described in Chapter 13, Evolutionary Computing, and the Q-learning reinforcement method is introduced in Chapter 15, Reinforcement Learning.

This is a brief overview of machine learning algorithms with a suggested, approximate taxonomy. There are almost as many ways to introduce machine learning as there are data and computer scientists. We encourage you to browse the list of references at the end of the book to find the documentation appropriate to his/her level of interest and understanding.

Scala for Machine Learning, Second Edition - Second Edition

Scala for Machine Learning, Second Edition - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Scala for Machine Learning, Second Edition - Second Edition

Scala Machine Learning Projects

A Handbook of Mathematical Models with Python

Mastering Predictive Analytics with R

Taxonomy of machine learning algorithms

Unsupervised learning

Clustering

Dimension reduction

Supervised learning

Generative models

Note

Discriminative models

Semi-supervised learning

Reinforcement learning