Book Image

Scala for Machine Learning

By : Patrick R. Nicolas
Book Image

Scala for Machine Learning

By: Patrick R. Nicolas

Overview of this book

Table of Contents (20 chapters)
Scala for Machine Learning
About the Author
About the Reviewers

Profiling data

The selection of a preprocessing, clustering, or classification algorithm depends highly on the quality and profile of the input data (observations and expected values whenever available). The Step 3 – preprocessing the data section under Let's kick the tires in Chapter 1, Getting Started, introduced the MinMax class for normalizing a dataset using the minimum and maximum values.

Immutable statistics

The mean and standard deviation are the most commonly used statistics.


Mean and variance

Arithmetic mean is defined as:

Variance is defined as:

Variance adjusted for a sampling bias is defined as:

Let's extend the MinMax class with some basic statistics capabilities using Stats:

class Stats[T < : AnyVal](
     values: Vector[T])(implicit f ; T => Double)
  extends MinMax[T](values) {

  val zero = (0.0. 0.0)
  val sums = values./:(zero)((s,x) =>(s._1 +x, s._2 + x*x)) //1
  lazy val mean = sums._1/values.size  //2
  lazy val variance = 
         (sums._2 - mean*mean*values...