Book Image

Scala Data Analysis Cookbook

By : Arun Manivannan
Book Image

Scala Data Analysis Cookbook

By: Arun Manivannan

Overview of this book

This book will introduce you to the most popular Scala tools, libraries, and frameworks through practical recipes around loading, manipulating, and preparing your data. It will also help you explore and make sense of your data using stunning and insightfulvisualizations, and machine learning toolkits. Starting with introductory recipes on utilizing the Breeze and Spark libraries, get to grips withhow to import data from a host of possible sources and how to pre-process numerical, string, and date data. Next, you’ll get an understanding of concepts that will help you visualize data using the Apache Zeppelin and Bokeh bindings in Scala, enabling exploratory data analysis. iscover how to program quintessential machine learning algorithms using Spark ML library. Work through steps to scale your machine learning models and deploy them into a standalone cluster, EC2, YARN, and Mesos. Finally dip into the powerful options presented by Spark Streaming, and machine learning for streaming data, as well as utilizing Spark GraphX.
Table of Contents (14 chapters)
Scala Data Analysis Cookbook
About the Author
About the Reviewers

Working with vectors

There are subtle yet powerful differences between Breeze vectors and Scala's own scala.collection.Vector. As we'll see in this recipe, Breeze vectors have a lot of functions that are linear algebra specific, and the more important thing to note here is that Breeze's vector is a Scala wrapper over netlib-java and most calls to the vector's API delegates the call to it.

Vectors are one of the core components in Breeze. They are containers of homogenous data. In this recipe, we'll first see how to create vectors and then move on to various data manipulation functions to modify those vectors.

In this recipe, we will look at various operations on vectors. This recipe has been organized in the form of the following sub-recipes:

  • Creating vectors:

    • Creating a vector from values

    • Creating a zero vector

    • Creating a vector out of a function

    • Creating a vector of linearly spaced values

    • Creating a vector with values in a specific range

    • Creating an entire vector with a single value

    • Slicing a sub-vector from a bigger vector

    • Creating a Breeze vector from a Scala vector

  • Vector arithmetic:

    • Scalar operations

    • Calculating the dot product of a vector

    • Creating a new vector by adding two vectors together

  • Appending vectors and converting a vector of one type to another:

    • Concatenating two vectors

    • Converting a vector of int to a vector of double

  • Computing basic statistics:

    • Mean and variance

    • Standard deviation

    • Find the largest value

    • Finding the sum, square root and log of all the values in the vector

Getting ready

In order to run the code, you could either use the Scala or use the Worksheet feature available in the Eclipse Scala plugin (or Scala IDE) or in IntelliJ IDEA. The reason these options are suggested is due to their quick turnaround time.

How to do it...

Let's look at each of the above sub-recipes in detail. For easier reference, the output of the respective command is shown as well. All the classes that are being used in this recipe are from the breeze.linalg package. So, an "import breeze.linalg._" statement at the top of your file would be perfect.

Creating vectors

Let's look at the various ways we could construct vectors. Most of these construction mechanisms are through the apply method of the vector. There are two different flavors of vector—breeze.linalg.DenseVector and breeze.linalg.SparseVector—the choice of the vector depends on the use case. The general rule of thumb is that if you have data that is at least 20 percent zeroes, you are better off choosing SparseVector but then the 20 percent is a variant too.

Constructing a vector from values

  • Creating a dense vector from values: Creating a DenseVector from values is just a matter of passing the values to the apply method:

      val dense=DenseVector(1,2,3,4,5)
      println (dense) //DenseVector(1, 2, 3, 4, 5)
  • Creating a sparse vector from values: Creating a SparseVector from values is also through passing the values to the apply method:

      val sparse=SparseVector(0.0, 1.0, 0.0, 2.0, 0.0)
      println (sparse) //SparseVector((0,0.0), (1,1.0), (2,0.0), (3,2.0), (4,0.0))

Notice how the SparseVector stores values against the index.

Obviously, there are simpler ways to create a vector instead of just throwing all the data into its apply method.

Creating a zero vector

Calling the vector's zeros function would create a zero vector. While the numeric types would return a 0, the object types would return null and the Boolean types would return false:

  val denseZeros=DenseVector.zeros[Double](5)  //DenseVector(0.0, 0.0, 0.0, 0.0, 0.0)

  val sparseZeros=SparseVector.zeros[Double](5)  //SparseVector()

Not surprisingly, the SparseVector does not allocate any memory for the contents of the vector. However, the creation of the SparseVector object itself is accounted for in the memory.

Creating a vector out of a function

The tabulate function in vector is an interesting and useful function. It accepts a size argument just like the zeros function but it also accepts a function that we could use to populate the values for the vector. The function could be anything ranging from a random number generator to a naïve index based generator, which we have implemented here. Notice how the return value of the function (Int) could be converted into a vector of Double by using the type parameter:

val denseTabulate=DenseVector.tabulate[Double](5)(index=>index*index) //DenseVector(0.0, 1.0, 4.0, 9.0, 16.0)

Creating a vector of linearly spaced values

The linspace function in breeze.linalg creates a new Vector[Double] of linearly spaced values between two arbitrary numbers. Not surprisingly, it accepts three arguments—the start, end, and the total number of values that we would like to generate. Please note that the start and the end values are inclusive while being generated:

val spaceVector=breeze.linalg.linspace(2, 10, 5)
//DenseVector(2.0, 4.0, 6.0, 8.0, 10.0)

Creating a vector with values in a specific range

The range function in a vector has two variants. The plain vanilla function accepts a start and end value (start inclusive):

val allNosTill10=DenseVector.range(0, 10)
//DenseVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

The other variant is an overloaded function that accepts a "step" value:

val evenNosTill20=DenseVector.range(0, 20, 2)
// DenseVector(0, 2, 4, 6, 8, 10, 12, 14, 16, 18)

Just like the range function, which has all the arguments as integers, there is also a rangeD function that takes the start, stop, and the step parameters as Double:

val rangeD=DenseVector.rangeD(0.5, 20, 2.5)
// DenseVector(0.5, 3.0, 5.5, 8.0, 10.5, 13.0, 15.5)

Creating an entire vector with a single value

Filling an entire vector with the same value is child's play. We just say HOW BIG is this vector going to be and then WHAT value. That's it.

val denseJust2s=DenseVector.fill(10, 2)
// DenseVector(2, 2, 2, 2, 2, 2 , 2, 2, 2, 2)

Slicing a sub-vector from a bigger vector

Choosing a part of the vector from a previous vector is just a matter of calling the slice method on the bigger vector. The parameters to be passed are the start index, end index, and an optional "step" parameter. The step parameter adds the step value for every iteration until it reaches the end index. Note that the end index is excluded in the sub-vector:

val allNosTill10=DenseVector.range(0, 10)
//DenseVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
val fourThroughSevenIndexVector= allNosTill10.slice(4, 7)
//DenseVector(4, 5, 6)
val twoThroughNineSkip2IndexVector= allNosTill10.slice(2, 9, 2)
//DenseVector(2, 4, 6)

Creating a Breeze Vector from a Scala Vector

A Breeze vector object's apply method could even accept a Scala Vector as a parameter and construct a vector out of it:

val vectFromArray=DenseVector(collection.immutable.Vector(1,2,3,4))
// DenseVector(Vector(1, 2, 3, 4))

Vector arithmetic

Now let's look at the basic arithmetic that we could do on vectors with scalars and vectors.

Scalar operations

Operations with scalars work just as we would expect, propagating the value to each element in the vector.

Adding a scalar to each element of the vector is done using the + function (surprise!):

val inPlaceValueAddition=evenNosTill20 +2
//DenseVector(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

Similarly the other basic arithmetic operations—subtraction, multiplication, and division involves calling the respective functions named after the universally accepted symbols (-, *, and /):

//Scalar subtraction
val inPlaceValueSubtraction=evenNosTill20 -2
//DenseVector(-2, 0, 2, 4, 6, 8, 10, 12, 14, 16)

 //Scalar multiplication
val inPlaceValueMultiplication=evenNosTill20 *2
//DenseVector(0, 4, 8, 12, 16, 20, 24, 28, 32, 36)

//Scalar division
val inPlaceValueDivision=evenNosTill20 /2
//DenseVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

Calculating the dot product of two vectors

Each vector object has a function called dot, which accepts another vector of the same length as a parameter.

Let's fill in just 2s to a new vector of length 5:

val justFive2s=DenseVector.fill(5, 2)
 //DenseVector(2, 2, 2, 2, 2)

We'll create another vector from 0 to 5 with a step value of 1 (a fancy way of saying 0 through 4):

 val zeroThrough4=DenseVector.range(0, 5, 1)
 //DenseVector(0, 1, 2, 3, 4)

Here's the dot function:

 //Int = 20

It is to be expected of the function to complain if we pass in a vector of a different length as a parameter to the dot product - Breeze throws an IllegalArgumentException if we do that. The full exception message is:

Java.lang.IllegalArgumentException: Vectors must be the same length!

Creating a new vector by adding two vectors together

The + function is overloaded to accept a vector other than the scalar we saw previously. The operation does a corresponding element-by-element addition and creates a new vector:

val evenNosTill20=DenseVector.range(0, 20, 2)
//DenseVector(0, 2, 4, 6, 8, 10, 12, 14, 16, 18)

val denseJust2s=DenseVector.fill(10, 2)
//DenseVector(2, 2, 2, 2, 2, 2, 2, 2, 2, 2)

val additionVector=evenNosTill20 + denseJust2s
// DenseVector(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

There's an interesting behavior encapsulated in the addition though. Assuming you try to add two vectors of different lengths, if the first vector is smaller and the second vector larger, the resulting vector would be the size of the first vector and the rest of the elements in the second vector would be ignored!

val fiveLength=DenseVector(1,2,3,4,5)
//DenseVector(1, 2, 3, 4, 5)
val tenLength=DenseVector.fill(10, 20)
//DenseVector(20, 20, 20, 20, 20, 20, 20, 20, 20, 20)

//DenseVector(21, 22, 23, 24, 25)

On the other hand, if the first vector is larger and the second vector smaller, it would result in an ArrayIndexOutOfBoundsException:

// java.lang.ArrayIndexOutOfBoundsException: 5

Appending vectors and converting a vector of one type to another

Let's briefly see how to append two vectors and convert vectors of one numeric type to another.

Concatenating two vectors

There are two variants of concatenation. There is a vertcat function that just vertically concatenates an arbitrary number of vectors—the size of the vector just increases to the sum of the sizes of all the vectors combined:

val justFive2s=DenseVector.fill(5, 2)
 //DenseVector(2, 2, 2, 2, 2)

 val zeroThrough4=DenseVector.range(0, 5, 1)
 //DenseVector(0, 1, 2, 3, 4)

val concatVector=DenseVector.vertcat(zeroThrough4, justFive2s)
//DenseVector(0, 1, 2, 3, 4, 2, 2, 2, 2, 2)

No surprise here. There is also the horzcat method that places the second vector horizontally next to the first vector, thus forming a matrix.

val concatVector1=DenseVector.horzcat(zeroThrough4, justFive2s)
0  2
1  2
2  2
3  2
4  2


While dealing with vectors of different length, the vertcat function happily arranges the second vector at the bottom of the first vector. Not surprisingly, the horzcat function throws an exception:

java.lang.IllegalArgumentException, meaning all vectors must be of the same size!

Converting a vector of Int to a vector of Double

The conversion of one type of vector into another is not automatic in Breeze. However, there is a simple way to achieve this:

val evenNosTill20Double=breeze.linalg.convert(evenNosTill20, Double)
Computing basic statistics

Other than the creation and the arithmetic operations that we saw previously, there are some interesting summary statistics operations that are available in the library. Let's look at them now:


Needs import of breeze.linalg._ and breeze.numerics._. The operations in the Other operations section aim to simulate the NumPy's UFunc or universal functions.

Now, let's briefly look at how to calculate some basic summary statistics for a vector.

Mean and variance

Calculating the mean and variance of a vector could be achieved by calling the meanAndVariance universal function in the breeze.stats package. Note that this needs a vector of Double:



As you may have guessed, converting an Int vector to a Double vector and calculating the mean and variance for that vector could be merged into a one-liner:

meanAndVariance(convert(evenNosTill20, Double))

Standard deviation

Calling the stddev on a Double vector could give the standard deviation:

//Double = 6.0553007081949835

Find the largest value in a vector

The max universal function inside the breeze.linalg package would help us find the maximum value in a vector:

val intMaxOfVectorVals=max (evenNosTill20)

Finding the sum, square root and log of all the values in the vector

The same as with max, the sum universal function inside the breeze.linalg package calculates the sum of the vector:

val intSumOfVectorVals=sum (evenNosTill20)

The functions sqrt, log, and various other universal functions in the breeze.numerics package calculate the square root and log values of all the individual elements inside the vector:

The Sqrt function
val sqrtOfVectorVals= sqrt (evenNosTill20)
// DenseVector(0.0, 1. 4142135623730951, 2.0, 2.449489742783178, 
2.8284271247461903, 3.16227766016 83795, 3.4641016151377544, 3.7416573867739413, 4.0, 4.242640687119285)
The Log function
val log2VectorVals=log(evenNosTill20)
// DenseVector(-Infinity , 0.6931471805599453, 1.3862943611198906, 1.791759469228055, 2.079441541679 8357, 2.302585092994046, 2.4849066497880004, 2.6390573296152584, 2.77258872 2239781, 2.8903717578961645)