Book Image

Scala Data Analysis Cookbook

By : Arun Manivannan
Book Image

Scala Data Analysis Cookbook

By: Arun Manivannan

Overview of this book

This book will introduce you to the most popular Scala tools, libraries, and frameworks through practical recipes around loading, manipulating, and preparing your data. It will also help you explore and make sense of your data using stunning and insightfulvisualizations, and machine learning toolkits. Starting with introductory recipes on utilizing the Breeze and Spark libraries, get to grips withhow to import data from a host of possible sources and how to pre-process numerical, string, and date data. Next, you’ll get an understanding of concepts that will help you visualize data using the Apache Zeppelin and Bokeh bindings in Scala, enabling exploratory data analysis. iscover how to program quintessential machine learning algorithms using Spark ML library. Work through steps to scale your machine learning models and deploy them into a standalone cluster, EC2, YARN, and Mesos. Finally dip into the powerful options presented by Spark Streaming, and machine learning for streaming data, as well as utilizing Spark GraphX.
Table of Contents (14 chapters)
Scala Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Working with matrices


As we discussed in the Working with vectors recipe, you could use the Eclipse or IntelliJ IDEA Scala worksheets for a faster turnaround time.

How to do it...

There are a variety of functions that we have in a matrix. In this recipe, we will look at some details around:

  • Creating matrices:

    • Creating a matrix from values

    • Creating a zero matrix

    • Creating a matrix out of a function

    • Creating an identity matrix

    • Creating a matrix from random numbers

    • Creating from a Scala collection

  • Matrix arithmetic:

    • Addition

    • Multiplication (also element-wise)

  • Appending and conversion:

    • Concatenating a matrix vertically

    • Concatenating a matrix horizontally

    • Converting a matrix of Int to a matrix of Double

  • Data manipulation operations:

    • Getting column vectors

    • Getting row vectors

    • Getting values inside the matrix

    • Getting the inverse and transpose of a matrix

  • Computing basic statistics:

    • Mean and variance

    • Standard deviation

    • Finding the largest value

    • Finding the sum, square root and log of all the values in the matrix

    • Calculating the eigenvectors and eigenvalues of a matrix

Creating matrices

Let's first see how to create a matrix.

Creating a matrix from values

The simplest way to create a matrix is to pass in the values in a row-wise fashion into the apply function of the matrix object:

val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))
//Returns a DenseMatrix[Int]
 1   2   3
11  12  13
21  22  23

There's also a Sparse version of the matrix too—the Compressed Sparse Column Matrix (CSCMatrix):

val sparseMatrix=CSCMatrix((1,0,0),(11,0,0),(0,0,23))
//Returns a SparseMatrix[Int]
(0,0) 1
(1,0) 11
(2,2) 23

Note

Breeze's Sparse matrix is a Dictionary of Keys (DOK) representation with (row, column) mapped against the value.

Creating a zero matrix

Creating a zero matrix is just a matter of calling the matrix's zeros function. The first integer parameter indicates the rows and the second parameter indicates the columns:

val denseZeros=DenseMatrix.zeros[Double](5,4)
//Returns a DenseMatrix[Double]
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0

val compressedSparseMatrix=CSCMatrix.zeros[Double](5,4)
//Returns a CSCMatrix[Double] = 5 x 4 CSCMatrix

Note

Notice how the SparseMatrix doesn't allocate any memory for the values in the zero value matrix.

Creating a matrix out of a function

The tabulate function in a matrix is very similar to the vector's version. It accepts a row and column size as a tuple (in the example (5,4)). It also accepts a function that we could use to populate the values for the matrix. In our example, we generated the values of the matrix by just multiplying the row and column index:

val denseTabulate=DenseMatrix.tabulate[Double](5,4)((firstIdx,secondIdx)=>firstIdx*secondIdx)

Returns a DenseMatrix[Double] =
0.0  0.0  0.0  0.0
0.0  1.0  2.0  3.0
0.0  2.0  4.0  6.0
0.0  3.0  6.0  9.0
0.0  4.0  8.0  12.0

The type parameter is needed only if you would like to convert the type of the matrix from an Int to a Double. So, the following call without the parameter would just return an Int matrix:

val denseTabulate=DenseMatrix.tabulate(5,4)((firstIdx,secondIdx)=>firstIdx*secondIdx)

0  1  2  3
0  2  4  6
0  3  6  9
0  4  8  12
Creating an identity matrix

The eye function of the matrix would generate an identity square matrix with the given dimension (in the example's case, 3):

val identityMatrix=DenseMatrix.eye[Int](3)
Returns a DenseMatrix[Int]
1  0  0
0  1  0
0  0  1
Creating a matrix from random numbers

The rand function in the matrix would generate a matrix of a given dimension (4 rows * 4 columns in our case) with random values between 0 and 1. We'll have an in-depth look into random number generated vectors and matrices in a subsequent recipe.

val randomMatrix=DenseMatrix.rand(4, 4)

Returns DenseMatrix[Double]
0.09762565779429777   0.01089176285376725  0.2660579009292807 0.19428193961985674
0.9662568115400412    0.718377391997945    0.8230367668470933 0.3957540854393169
0.9080090988364429    0.7697780247035393   0.49887760321635066 0.26722019105654415
3.326843165250004E-4  0.447925644082819    0.8195838733418965 0.7682752255172411
Creating from a Scala collection

We could create a matrix out of a Scala array too. The constructor of the matrix accepts three arguments—the rows, the columns, and an array with values for the dimensions. Note that the data from the array is picked up to construct the matrix in the column first order:

val vectFromArray=new DenseMatrix(2,2,Array(2,3,4,5))
Returns DenseMatrix[Int]
2  4
3  5

If there are more values than the number of values required by the dimensions of the matrix, the rest of the values are ignored. Note how (6,7) is ignored in the array:

val vectFromArray=new DenseMatrix(2,2,Array(2,3,4,5,6,7))
DenseMatrix[Int]
2  4
3  5

However, if fewer values are present in the array than what is required by the dimensions of the matrix, then the constructor call would throw an ArrayIndexOutOfBoundsException:

val vectFromArrayIobe=new DenseMatrix(2,2,Array(2,3,4))

//throws java.lang.ArrayIndexOutOfBoundsException: 3

Matrix arithmetic

Now let's look at the basic arithmetic that we could do using matrices.

Let's consider a simple 3*3 simpleMatrix and a corresponding identity matrix:

val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))
//DenseMatrix[Int]
1   2   3
11  12  13
21  22  23

val identityMatrix=DenseMatrix.eye[Int](3)
//DenseMatrix[Int]
1  0  0
0  1  0
0  0  1
Addition

Adding two matrices will result in a matrix whose corresponding elements are summed up.

val additionMatrix=identityMatrix + simpleMatrix
// Returns DenseMatrix[Int]
2   2   3
11  13  13
21  22  24
Multiplication

Now, as you would expect, multiplying a matrix with its identity should give you the matrix itself:

val simpleTimesIdentity=simpleMatrix * identityMatrix
//Returns DenseMatrix[Int]
1   2   3
11  12  13
21  22  23

Breeze also has an alternative element-by-element operation that has the format of prefixing the operator with a colon, for example, :+,:-, :*, and so on. Check out what happens when we do an element-wise multiplication of the identity matrix and the simple matrix:

val elementWiseMulti=identityMatrix :* simpleMatrix
//DenseMatrix[Int]
1  0   0
0  12  0
0  0   23

Appending and conversion

Let's briefly see how to append two matrices and convert matrices of one numeric type to another.

Concatenating matrices – vertically

Similar to vectors, matrix has a vertcat function, which vertically concatenates an arbitrary number of matrices—the row size of the matrix just increases to the sum of the row sizes of all matrices combined:

val vertConcatMatrix=DenseMatrix.vertcat(identityMatrix, simpleMatrix)

//DenseMatrix[Int]
1   0   0
0   1   0
0   0   1
1   2   3
11  12  13
21  22  23

Attempting to concatenate a matrix of different columns would, as expected, throw an IllegalArgumentException:

java.lang.IllegalArgumentException: requirement failed: Not all matrices have the same number of columns
Concatenating matrices – horizontally

Not surprisingly, the horzcat function concatenates the matrix horizontally—the column size of the matrix increases to the sum of the column sizes of all the matrices:

val horzConcatMatrix=DenseMatrix.horzcat(identityMatrix, simpleMatrix)
// DenseMatrix[Int]
1  0  0  1   2   3
0  1  0  11  12  13
0  0  1  21  22  23

Similar to the vertical concatenation, attempting to concatenate a matrix of a different row size would throw an IllegalArgumentException:

java.lang.IllegalArgumentException: requirement failed: Not all matrices have the same number of rows
Converting a matrix of Int to a matrix of Double

The conversion of one type of matrix to another is not automatic in Breeze. However, there is a simple way to achieve this:

import breeze.linalg.convert
val simpleMatrixAsDouble=convert(simpleMatrix, Double)
// DenseMatrix[Double] =
1.0   2.0   3.0
11.0  12.0  13.0
21.0  22.0  23.0

Data manipulation operations

Let's create a simple 2*2 matrix that will be used for the rest of this section:

val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0))
//DenseMatrix[Double] =
4.0  7.0
3.0  -5.0
Getting column vectors out of the matrix

The first column vector could be retrieved by passing in the column parameter as 0 and using :: in order to say that we are interested in all the rows.

val firstVector=simpleMatrix(::,0)
//DenseVector(4.0, 3.0)

Getting the second column vector and so on is achieved by passing the correct zero-indexed column number:

val secondVector=simpleMatrix(::,1)
//DenseVector(7.0, -5.0)

Alternatively, you could explicitly pass in the columns to be extracted:

val firstVectorByCols=simpleMatrix(0 to 1,0)
//DenseVector(4.0, 3.0)

While explicitly stating the range (as in 0 to 1), we have to be careful not to exceed the matrix size. For example, the following attempt to select 3 columns (0 through 2) on a 2 * 2 matrix would throw an ArrayIndexOutOfBoundsException:

val errorTryingToSelect3ColumnsOn2By2Matrix=simpleMatrix(0,0 to 2)
//java.lang.ArrayIndexOutOfBoundsException
Getting row vectors out of the matrix

If we would like to get the row vector, all we need to do is play with the row and column parameters again. As expected, it would give a transpose of the column vector, which is simply a row vector.

Like the column vector, we could either explicitly state our columns or pass in a wildcard (::) to cover the entire range of columns:

val firstRowStatingCols=simpleMatrix(0,0 to 1)
//Transpose(DenseVector(4.0, 7.0))

val firstRowAllCols=simpleMatrix(0,::)
//Transpose(DenseVector(4.0, 7.0))

Getting the second row vector is achieved by passing the second row (1) and all the columns (::) in that vector:

val secondRow=simpleMatrix(1,::)
//Transpose(DenseVector(3.0, -5.0))
Getting values inside the matrix

Assuming we are just interested in the values within the matrix, pass in the exact row and the column number of the matrix. In order to get the first row and first column of the matrix, just pass in the row and the column number:

val firstRowFirstCol=simpleMatrix(0,0)
//Double = 4.0
Getting the inverse and transpose of a matrix

Getting the inverse and the transpose of a matrix is a little counter-intuitive in Breeze. Let's consider the same matrix that we dealt with earlier:

val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0))

On the one hand, transpose is a function on the matrix object itself, like so:

val transpose=simpleMatrix.t
4.0  3.0
7.0  -5.0

inverse, on the other hand is a universal function under the breeze.linalg package:

val inverse=inv(simpleMatrix)

0.12195121951219512  0.17073170731707318
0.07317073170731708  -0.0975609756097561

Let's do a matrix product to its inverse and confirm whether it is an identity matrix:

simpleMatrix * inverse

1.0  0.0
-5.551115123125783E-17  1.0

As expected, the result is indeed an identity matrix with rounding errors when doing floating point arithmetic.

Computing basic statistics

Now, just like vectors, let's briefly look at how to calculate some basic summary statistics for a matrix.

Tip

This needs import of breeze.linalg._, breeze.numerics._ and, breeze.stats._. The operations in the "Other operations" section aims to simulate the NumPy's UFunc or universal functions.

Mean and variance

Calculating the mean and variance of a matrix could be achieved by calling the meanAndVariance universal function in the breeze.stats package. Note that this needs a matrix of Double:

meanAndVariance(simpleMatrixAsDouble)
// MeanAndVariance(12.0,75.75,9)

Alternatively, converting an Int matrix to a Double matrix and calculating the mean and variance for that Matrix could be merged into a one-liner:

meanAndVariance(convert(simpleMatrix, Double))
Standard deviation

Calling the stddev on a Double vector could give the standard deviation:

stddev(simpleMatrixAsDouble)
//Double = 8.703447592764606

Next up, let's look at some basic aggregation operations:

val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))
Finding the largest value in a matrix

The (apply method of the) max object (a universal function) inside the breeze.linalg package will help us do that:

val intMaxOfMatrixVals=max (simpleMatrix)
//23
Finding the sum, square root and log of all the values in the matrix

The same as with max, the sum object inside the breeze.linalg package calculates the sum of all the matrix elements:

val intSumOfMatrixVals=sum (simpleMatrix)
//108

The functions sqrt, log, and various other objects (universal functions) in the breeze.numerics package calculate the square root and log values of all the individual values inside the matrix.

Sqrt
val sqrtOfMatrixVals= sqrt (simpleMatrix)
//DenseMatrix[Double] =
1.0              1.4142135623730951  1.7320508075688772
3.3166247903554   3.4641016151377544  3.605551275463989
4.58257569495584  4.69041575982343    4.795831523312719
Log
val log2MatrixVals=log(simpleMatrix)
//DenseMatrix[Double]
0.0                 0.6931471805599453  1.0986122886681098
2.3978952727983707  2.4849066497880004  2.5649493574615367
3.044522437723423   3.091042453358316   3.1354942159291497
Calculating the eigenvectors and eigenvalues of a matrix

Calculating eigenvectors is straightforward in Breeze. Let's consider our simpleMatrix from the previous section:

val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0))

Calling the breeze.linalg.eig universal function on a matrix returns a breeze.linalg.eig.DenseEig object that encapsulate eigenvectors and eigenvalues:

val denseEig=eig(simpleMatrix)

This line of code returns the following:

Eig(
DenseVector(5.922616289332565, -6.922616289332565),
DenseVector(0.0, 0.0)
,0.9642892971721949   -0.5395744865143975  0.26485118719604456 0.8419378679586305)

We could extract the eigenvectors and eigenvalues by calling the corresponding functions on the returned Eig reference:

val eigenVectors=denseEig.eigenvectors
//DenseMatrix[Double] =
0.9642892971721949   -0.5395744865143975
0.26485118719604456  0.8419378679586305

The two eigenValues corresponding to the two eigenvectors could be captured using the eigenvalues function on the Eig object:

val eigenValues=denseEig.eigenvalues
//DenseVector[Double] = DenseVector(5.922616289332565, -6.922616289332565)

Let's validate the eigenvalues and the vectors:

  1. Let's multiply the matrix with the first eigenvector:

    val matrixToEigVector=simpleMatrix*denseEig.eigenvectors (::,0)
    //DenseVector(5.7111154990610915, 1.568611955536362)
  2. Then let's multiply the first eigenvalue with the first eigenvector. The resulting vector will be the same with a marginal error when doing floating point arithmetic:

    val vectorToEigValue=denseEig.eigenvectors(::,0) * denseEig.eigenvalues (0)
    //DenseVector(5.7111154990610915, 1.5686119555363618)

How it works...

The same as with vectors, the initialization of the Breeze matrices are achieved by way of the apply method or one of the various methods in the matrix's Object class. Various other operations are provided by way of polymorphic functions available in the breeze.numeric, breeze.linalg and breeze.stats packages.