#### Overview of this book

This book will introduce you to the most popular Scala tools, libraries, and frameworks through practical recipes around loading, manipulating, and preparing your data. It will also help you explore and make sense of your data using stunning and insightfulvisualizations, and machine learning toolkits. Starting with introductory recipes on utilizing the Breeze and Spark libraries, get to grips withhow to import data from a host of possible sources and how to pre-process numerical, string, and date data. Next, you’ll get an understanding of concepts that will help you visualize data using the Apache Zeppelin and Bokeh bindings in Scala, enabling exploratory data analysis. iscover how to program quintessential machine learning algorithms using Spark ML library. Work through steps to scale your machine learning models and deploy them into a standalone cluster, EC2, YARN, and Mesos. Finally dip into the powerful options presented by Spark Streaming, and machine learning for streaming data, as well as utilizing Spark GraphX.
Scala Data Analysis Cookbook
Credits
www.PacktPub.com
Preface
Free Chapter
Getting Started with Breeze
Getting Started with Apache Spark DataFrames
Data Visualization
Index

## Working with matrices

As we discussed in the Working with vectors recipe, you could use the Eclipse or IntelliJ IDEA Scala worksheets for a faster turnaround time.

### How to do it...

There are a variety of functions that we have in a matrix. In this recipe, we will look at some details around:

• Creating matrices:

• Creating a matrix from values

• Creating a zero matrix

• Creating a matrix out of a function

• Creating an identity matrix

• Creating a matrix from random numbers

• Creating from a Scala collection

• Matrix arithmetic:

• Multiplication (also element-wise)

• Appending and conversion:

• Concatenating a matrix vertically

• Concatenating a matrix horizontally

• Converting a matrix of Int to a matrix of Double

• Data manipulation operations:

• Getting column vectors

• Getting row vectors

• Getting values inside the matrix

• Getting the inverse and transpose of a matrix

• Computing basic statistics:

• Mean and variance

• Standard deviation

• Finding the largest value

• Finding the sum, square root and log of all the values in the matrix

• Calculating the eigenvectors and eigenvalues of a matrix

#### Creating matrices

Let's first see how to create a matrix.

##### Creating a matrix from values

The simplest way to create a matrix is to pass in the values in a row-wise fashion into the `apply` function of the matrix object:

```val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))
//Returns a DenseMatrix[Int]
1   2   3
11  12  13
21  22  23
```

There's also a Sparse version of the matrix too—the Compressed Sparse Column Matrix (CSCMatrix):

```val sparseMatrix=CSCMatrix((1,0,0),(11,0,0),(0,0,23))
//Returns a SparseMatrix[Int]
(0,0) 1
(1,0) 11
(2,2) 23
```

### Note

Breeze's Sparse matrix is a Dictionary of Keys (DOK) representation with (row, column) mapped against the value.

##### Creating a zero matrix

Creating a zero matrix is just a matter of calling the matrix's `zeros` function. The first integer parameter indicates the rows and the second parameter indicates the columns:

```val denseZeros=DenseMatrix.zeros[Double](5,4)
//Returns a DenseMatrix[Double]
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0

val compressedSparseMatrix=CSCMatrix.zeros[Double](5,4)
//Returns a CSCMatrix[Double] = 5 x 4 CSCMatrix
```

### Note

Notice how the `SparseMatrix` doesn't allocate any memory for the values in the zero value matrix.

##### Creating a matrix out of a function

The `tabulate` function in a matrix is very similar to the vector's version. It accepts a row and column size as a tuple (in the example `(5,4)`). It also accepts a function that we could use to populate the values for the matrix. In our example, we generated the values of the matrix by just multiplying the row and column index:

```val denseTabulate=DenseMatrix.tabulate[Double](5,4)((firstIdx,secondIdx)=>firstIdx*secondIdx)

Returns a DenseMatrix[Double] =
0.0  0.0  0.0  0.0
0.0  1.0  2.0  3.0
0.0  2.0  4.0  6.0
0.0  3.0  6.0  9.0
0.0  4.0  8.0  12.0
```

The `type` parameter is needed only if you would like to convert the type of the matrix from an `Int` to a `Double`. So, the following call without the parameter would just return an `Int` matrix:

```val denseTabulate=DenseMatrix.tabulate(5,4)((firstIdx,secondIdx)=>firstIdx*secondIdx)

0  1  2  3
0  2  4  6
0  3  6  9
0  4  8  12
```
##### Creating an identity matrix

The `eye` function of the matrix would generate an identity square matrix with the given dimension (in the example's case, `3`):

```val identityMatrix=DenseMatrix.eye[Int](3)
Returns a DenseMatrix[Int]
1  0  0
0  1  0
0  0  1
```
##### Creating a matrix from random numbers

The `rand` function in the matrix would generate a matrix of a given dimension (4 rows * 4 columns in our case) with random values between `0` and `1`. We'll have an in-depth look into random number generated vectors and matrices in a subsequent recipe.

```val randomMatrix=DenseMatrix.rand(4, 4)

Returns DenseMatrix[Double]
0.09762565779429777   0.01089176285376725  0.2660579009292807 0.19428193961985674
0.9662568115400412    0.718377391997945    0.8230367668470933 0.3957540854393169
0.9080090988364429    0.7697780247035393   0.49887760321635066 0.26722019105654415
3.326843165250004E-4  0.447925644082819    0.8195838733418965 0.7682752255172411
```
##### Creating from a Scala collection

We could create a matrix out of a Scala array too. The constructor of the matrix accepts three arguments—the rows, the columns, and an array with values for the dimensions. Note that the data from the array is picked up to construct the matrix in the column first order:

```val vectFromArray=new DenseMatrix(2,2,Array(2,3,4,5))
Returns DenseMatrix[Int]
2  4
3  5
```

If there are more values than the number of values required by the dimensions of the matrix, the rest of the values are ignored. Note how `(6,7)` is ignored in the array:

```val vectFromArray=new DenseMatrix(2,2,Array(2,3,4,5,6,7))
DenseMatrix[Int]
2  4
3  5
```

However, if fewer values are present in the array than what is required by the dimensions of the matrix, then the constructor call would throw an `ArrayIndexOutOfBoundsException`:

```val vectFromArrayIobe=new DenseMatrix(2,2,Array(2,3,4))

//throws java.lang.ArrayIndexOutOfBoundsException: 3
```

#### Matrix arithmetic

Now let's look at the basic arithmetic that we could do using matrices.

Let's consider a simple 3*3 `simpleMatrix` and a corresponding identity matrix:

```val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))
//DenseMatrix[Int]
1   2   3
11  12  13
21  22  23

val identityMatrix=DenseMatrix.eye[Int](3)
//DenseMatrix[Int]
1  0  0
0  1  0
0  0  1
```

Adding two matrices will result in a matrix whose corresponding elements are summed up.

```val additionMatrix=identityMatrix + simpleMatrix
// Returns DenseMatrix[Int]
2   2   3
11  13  13
21  22  24
```
##### Multiplication

Now, as you would expect, multiplying a matrix with its identity should give you the matrix itself:

```val simpleTimesIdentity=simpleMatrix * identityMatrix
//Returns DenseMatrix[Int]
1   2   3
11  12  13
21  22  23
```

Breeze also has an alternative element-by-element operation that has the format of prefixing the operator with a colon, for example, `:+`,`:-`, `:*`, and so on. Check out what happens when we do an element-wise multiplication of the identity matrix and the simple matrix:

```val elementWiseMulti=identityMatrix :* simpleMatrix
//DenseMatrix[Int]
1  0   0
0  12  0
0  0   23
```

#### Appending and conversion

Let's briefly see how to append two matrices and convert matrices of one numeric type to another.

##### Concatenating matrices – vertically

Similar to vectors, matrix has a `vertcat` function, which vertically concatenates an arbitrary number of matrices—the row size of the matrix just increases to the sum of the row sizes of all matrices combined:

```val vertConcatMatrix=DenseMatrix.vertcat(identityMatrix, simpleMatrix)

//DenseMatrix[Int]
1   0   0
0   1   0
0   0   1
1   2   3
11  12  13
21  22  23
```

Attempting to concatenate a matrix of different columns would, as expected, throw an `IllegalArgumentException`:

`java.lang.IllegalArgumentException: requirement failed: Not all matrices have the same number of columns`
##### Concatenating matrices – horizontally

Not surprisingly, the `horzcat` function concatenates the matrix horizontally—the column size of the matrix increases to the sum of the column sizes of all the matrices:

```val horzConcatMatrix=DenseMatrix.horzcat(identityMatrix, simpleMatrix)
// DenseMatrix[Int]
1  0  0  1   2   3
0  1  0  11  12  13
0  0  1  21  22  23
```

Similar to the vertical concatenation, attempting to concatenate a matrix of a different row size would throw an `IllegalArgumentException`:

`java.lang.IllegalArgumentException: requirement failed: Not all matrices have the same number of rows`
##### Converting a matrix of Int to a matrix of Double

The conversion of one type of matrix to another is not automatic in Breeze. However, there is a simple way to achieve this:

```import breeze.linalg.convert
val simpleMatrixAsDouble=convert(simpleMatrix, Double)
// DenseMatrix[Double] =
1.0   2.0   3.0
11.0  12.0  13.0
21.0  22.0  23.0
```

#### Data manipulation operations

Let's create a simple 2*2 matrix that will be used for the rest of this section:

```val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0))
//DenseMatrix[Double] =
4.0  7.0
3.0  -5.0
```
##### Getting column vectors out of the matrix

The first column vector could be retrieved by passing in the column parameter as `0` and using `::` in order to say that we are interested in all the rows.

```val firstVector=simpleMatrix(::,0)
//DenseVector(4.0, 3.0)```

Getting the second column vector and so on is achieved by passing the correct zero-indexed column number:

```val secondVector=simpleMatrix(::,1)
//DenseVector(7.0, -5.0)```

Alternatively, you could explicitly pass in the columns to be extracted:

```val firstVectorByCols=simpleMatrix(0 to 1,0)
//DenseVector(4.0, 3.0)```

While explicitly stating the range (as in `0` to `1`), we have to be careful not to exceed the matrix size. For example, the following attempt to select 3 columns (`0` through `2`) on a 2 * 2 matrix would throw an `ArrayIndexOutOfBoundsException`:

```val errorTryingToSelect3ColumnsOn2By2Matrix=simpleMatrix(0,0 to 2)
//java.lang.ArrayIndexOutOfBoundsException```
##### Getting row vectors out of the matrix

If we would like to get the row vector, all we need to do is play with the row and column parameters again. As expected, it would give a transpose of the column vector, which is simply a row vector.

Like the column vector, we could either explicitly state our columns or pass in a wildcard (`::`) to cover the entire range of columns:

```val firstRowStatingCols=simpleMatrix(0,0 to 1)
//Transpose(DenseVector(4.0, 7.0))

val firstRowAllCols=simpleMatrix(0,::)
//Transpose(DenseVector(4.0, 7.0))```

Getting the second row vector is achieved by passing the second row (1) and all the columns (`::`) in that vector:

```val secondRow=simpleMatrix(1,::)
//Transpose(DenseVector(3.0, -5.0))```
##### Getting values inside the matrix

Assuming we are just interested in the values within the matrix, pass in the exact row and the column number of the matrix. In order to get the first row and first column of the matrix, just pass in the row and the column number:

```val firstRowFirstCol=simpleMatrix(0,0)
//Double = 4.0```
##### Getting the inverse and transpose of a matrix

Getting the inverse and the transpose of a matrix is a little counter-intuitive in Breeze. Let's consider the same matrix that we dealt with earlier:

`val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0))`

On the one hand, `transpose` is a function on the matrix object itself, like so:

```val transpose=simpleMatrix.t
4.0  3.0
7.0  -5.0
```

`inverse`, on the other hand is a universal function under the `breeze.linalg` package:

```val inverse=inv(simpleMatrix)

0.12195121951219512  0.17073170731707318
0.07317073170731708  -0.0975609756097561
```

Let's do a matrix product to its inverse and confirm whether it is an identity matrix:

```simpleMatrix * inverse

1.0  0.0
-5.551115123125783E-17  1.0
```

As expected, the result is indeed an identity matrix with rounding errors when doing floating point arithmetic.

#### Computing basic statistics

Now, just like vectors, let's briefly look at how to calculate some basic summary statistics for a matrix.

### Tip

This needs import of `breeze.linalg._`, `breeze.numerics._` and, `breeze.stats._`. The operations in the "Other operations" section aims to simulate the NumPy's `UFunc` or universal functions.

##### Mean and variance

Calculating the mean and variance of a matrix could be achieved by calling the `meanAndVariance` universal function in the `breeze.stats` package. Note that this needs a matrix of `Double`:

```meanAndVariance(simpleMatrixAsDouble)
// MeanAndVariance(12.0,75.75,9)```

Alternatively, converting an `Int` matrix to a `Double` matrix and calculating the mean and variance for that Matrix could be merged into a one-liner:

`meanAndVariance(convert(simpleMatrix, Double))`
##### Standard deviation

Calling the `stddev` on a `Double` vector could give the standard deviation:

```stddev(simpleMatrixAsDouble)
//Double = 8.703447592764606```

Next up, let's look at some basic aggregation operations:

`val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))`
##### Finding the largest value in a matrix

The (`apply` method of the) `max` object (a universal function) inside the `breeze.linalg` package will help us do that:

```val intMaxOfMatrixVals=max (simpleMatrix)
//23```
##### Finding the sum, square root and log of all the values in the matrix

The same as with `max`, the `sum` object inside the `breeze.linalg` package calculates the sum of all the matrix elements:

```val intSumOfMatrixVals=sum (simpleMatrix)
//108```

The functions `sqrt`, `log`, and various other objects (universal functions) in the `breeze.numerics` package calculate the square root and log values of all the individual values inside the matrix.

##### Sqrt
```val sqrtOfMatrixVals= sqrt (simpleMatrix)
//DenseMatrix[Double] =
1.0              1.4142135623730951  1.7320508075688772
3.3166247903554   3.4641016151377544  3.605551275463989
4.58257569495584  4.69041575982343    4.795831523312719
```
##### Log
```val log2MatrixVals=log(simpleMatrix)
//DenseMatrix[Double]
0.0                 0.6931471805599453  1.0986122886681098
2.3978952727983707  2.4849066497880004  2.5649493574615367
3.044522437723423   3.091042453358316   3.1354942159291497
```
##### Calculating the eigenvectors and eigenvalues of a matrix

Calculating eigenvectors is straightforward in Breeze. Let's consider our `simpleMatrix` from the previous section:

`val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0))`

Calling the `breeze.linalg.eig` universal function on a matrix returns a `breeze.linalg.eig.DenseEig` object that encapsulate eigenvectors and eigenvalues:

`val denseEig=eig(simpleMatrix)`

This line of code returns the following:

```Eig(
DenseVector(5.922616289332565, -6.922616289332565),
DenseVector(0.0, 0.0)
,0.9642892971721949   -0.5395744865143975  0.26485118719604456 0.8419378679586305)
```

We could extract the eigenvectors and eigenvalues by calling the corresponding functions on the returned `Eig` reference:

```val eigenVectors=denseEig.eigenvectors
//DenseMatrix[Double] =
0.9642892971721949   -0.5395744865143975
0.26485118719604456  0.8419378679586305
```

The two `eigenValues` corresponding to the two `eigenvectors` could be captured using the `eigenvalues` function on the `Eig` object:

```val eigenValues=denseEig.eigenvalues
//DenseVector[Double] = DenseVector(5.922616289332565, -6.922616289332565)
```

Let's validate the eigenvalues and the vectors:

1. Let's multiply the matrix with the first eigenvector:

```val matrixToEigVector=simpleMatrix*denseEig.eigenvectors (::,0)
//DenseVector(5.7111154990610915, 1.568611955536362)```
2. Then let's multiply the first eigenvalue with the first eigenvector. The resulting vector will be the same with a marginal error when doing floating point arithmetic:

```val vectorToEigValue=denseEig.eigenvectors(::,0) * denseEig.eigenvalues (0)
//DenseVector(5.7111154990610915, 1.5686119555363618)```

### How it works...

The same as with vectors, the initialization of the Breeze matrices are achieved by way of the `apply` method or one of the various methods in the matrix's `Object` class. Various other operations are provided by way of polymorphic functions available in the `breeze.numeric`, `breeze.linalg` and `breeze.stats` packages.