Book Image

Scala for Machine Learning

By : Patrick R. Nicolas
Book Image

Scala for Machine Learning

By: Patrick R. Nicolas

Overview of this book

Table of Contents (20 chapters)
Scala for Machine Learning
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Profiling data


The selection of a preprocessing, clustering, or classification algorithm depends highly on the quality and profile of the input data (observations and expected values whenever available). The Step 3 – preprocessing the data section under Let's kick the tires in Chapter 1, Getting Started, introduced the MinMax class for normalizing a dataset using the minimum and maximum values.

Immutable statistics

The mean and standard deviation are the most commonly used statistics.

Note

Mean and variance

Arithmetic mean is defined as:

Variance is defined as:

Variance adjusted for a sampling bias is defined as:

Let's extend the MinMax class with some basic statistics capabilities using Stats:

class Stats[T < : AnyVal](
     values: Vector[T])(implicit f ; T => Double)
  extends MinMax[T](values) {

  val zero = (0.0. 0.0)
  val sums = values./:(zero)((s,x) =>(s._1 +x, s._2 + x*x)) //1
  
  lazy val mean = sums._1/values.size  //2
  lazy val variance = 
         (sums._2 - mean*mean*values...