Book Image

Machine Learning with Scala Quick Start Guide

By : Md. Rezaul Karim, Ajay Kumar N
Book Image

Machine Learning with Scala Quick Start Guide

By: Md. Rezaul Karim, Ajay Kumar N

Overview of this book

Scala is a highly scalable integration of object-oriented nature and functional programming concepts that make it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to develop and train effective machine learning models in Scala. The book starts with an introduction to machine learning, while covering deep learning and machine learning basics. It then explains how to use Scala-based ML libraries to solve classification and regression problems using linear regression, generalized linear regression, logistic regression, support vector machine, and Naïve Bayes algorithms. It also covers tree-based ensemble techniques for solving both classification and regression problems. Moving ahead, it covers unsupervised learning techniques, such as dimensionality reduction, clustering, and recommender systems. Finally, it provides a brief overview of deep learning using a real-life example in Scala.
Table of Contents (9 chapters)

ML libraries in Scala

Although Scala is a relatively new programming language compared to Java and Python, the question will arise as to why we need to consider learning it while we have Python and R. Well, Python and R are two leading programming languages for rapid prototyping and data analytics including building, exploring, and manipulating powerful models.

But Scala is becoming the key language too in the development of functional products, which are well suited for big data analytics. Big data applications often require stability, flexibility, high speed, scalability, and concurrency. All of these requirements can be fulfilled with Scala because Scala is not only a general-purpose language but also a powerful choice for data science (for example, Spark MLlib/ML). I've been using Scala for the last couple of years and I found that more and more Scala ML libraries are in development. Up next, we will discuss available and widely used Scala libraries that can be used for developing ML applications.

Interested readers can take a quick look at this, which lists the 15 most popular Scala libraries for ML and data science:
https://www.datasciencecentral.com/profiles/blogs/top-15-scala-libraries-for-data-science-in-2018-1

Spark MLlib and ML

MLlib is a library that provides user-friendly ML algorithms that are implemented using Scala. The same API is then exposed to provide support for other languages such as Java, Python, and R. Spark MLlib provides support for local vectors and matrix data types stored on a single machine, as well as distributed matrices backed by one or multiple resilient distributed datasets (RDDs).

RDD is the primary data abstraction of Apache Spark, often called Spark Core, that represents an immutable, partitioned collection of elements that can be operated on in parallel. The resiliency makes RDD fault-tolerant (based on RDD lineage graph). RDD can help in distributed computing even when data is stored on multiple nodes in a Spark cluster. Also, RDD can be converted into a dataset as a collection of partitioned data with primitive values such as tuples or other objects.

Spark ML is a new set of ML APIs that allows users to quickly assemble and configure practical machine learning pipelines on top of datasets, which makes it easier to combine multiple algorithms into a single pipeline. For example, an ML algorithm (called estimator) and a set of transformers (for example, a StringIndexer, a StandardScalar, and a VectorAssembler) can be chained together to perform the ML task as stages without needing to run them sequentially.

Interested readers can take a look at the Spark MLlib and ML guide at https://spark.apache.org/docs/latest/ml-guide.html.

At this point, I have to inform you of something very useful. Since we will be using Spark MLlib and ML APIs in upcoming chapters too. Therefore, it would be worth fixing some issues in advance. If you're a Windows user, then let me tell you about a very weird issue that you will experience while working with Spark. The thing is that Spark works on Windows, macOS, and Linux. While using Eclipse or IntelliJ IDEA to develop your Spark applications on Windows, you might face an I/O exception error and, consequently, your application might not compile successfully or may be interrupted.

Spark needs a runtime environment for Hadoop on Windows too. Unfortunately, the binary distribution of Spark (v2.4.0, for example) does not contain Windows-native components such as winutils.exe or hadoop.dll. However, these are required (not optional) to run Hadoop on Windows if you cannot ensure the runtime environment, an I/O exception saying the following will appear:

03/02/2019 11:11:10 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

There are two ways to tackle this issue on Windows and from IDEs such as Eclipse and IntelliJ IDEA:

  1. Download winutls.exe from https://github.com/steveloughran/ winutils/tree/ master/hadoop-2. 7. 1/bin/.
  2. Download and copy it inside the bin folder in the Spark distribution—for example, spark-2.2.0-bin-hadoop2.7/bin/.
  3. Select Project | Run Configurations... | Environment | New | and create a variable named HADOOP_HOME, then put the path in the Value field. Here is an example: c:/spark-2.2.0-bin-hadoop2.7/bin/ | OK | Apply | Run.

ScalNet and DynaML

ScalNet is a wrapper around Deeplearning4J intended to emulate a Keras-like API for developing deep learning applications. If you're already familiar with neural network architectures and are coming from a JVM background, it would be worth exploring the Scala-based ScalNet library:

DynaML is a Scala and JVM ML toolbox for research, education, and industry. This library provides an interactive, end-to-end, and enterprise-friendly way of developing ML applications. If you're interested, see more at https://transcendent-ai-labs.github.io/DynaML/.

ScalaNLP, Vegas, and Breeze

Breeze is one of the primary scientific computing libraries for Scala, which provides a fast and efficient way of data manipulation operations such as matrix and vector operations for creating, transposing, filling with numbers, conducting element-wise operations, and calculating determinants.

Breeze enables basic operations based on the netlib-java library, which enables extremely fast algebraic computations. In addition, Breeze provides a way to perform signal-processing operations, necessary for working with digital signals.

The following are the GitHub links:

On the other hand, ScalaNLP is a suite of scientific computing, ML, and natural language processing, which also acts as an umbrella project for several libraries, including Breeze and Epic. Vegas is another Scala library for data visualization, which allows plotting specifications such as filtering, transformations, and aggregations. Vegas is more functional than the other numerical processing library, Breeze.

For more information and examples of using Vegas and Breeze, refer to GitHub:

Whereas the visualization library of Breeze is backed by Breeze and JFreeChart, Vegas can be considered a missing Matplotlib for Scala and Spark, because it provides several options for rendering plots through and within interactive notebook environments, such as Jupyter and Zeppelin.

Refer to Zeppelin notebook solutions of each chapter in the GitHub repository of this book.