The selection of a preprocessing, clustering, or classification algorithm depends highly on the quality and profile of the input data (observations and expected values whenever available). The Step 3 – preprocessing the data section under Let's kick the tires in Chapter 1, Getting Started, introduced the MinMax
class for normalizing a dataset using the minimum and maximum values.
The mean and standard deviation are the most commonly used statistics.
Note
Mean and variance
Arithmetic mean is defined as:
Variance is defined as:
Variance adjusted for a sampling bias is defined as:
Let's extend the MinMax
class with some basic statistics capabilities using Stats
:
class Stats[T < : AnyVal]( values: Vector[T])(implicit f ; T => Double) extends MinMax[T](values) { val zero = (0.0. 0.0) val sums = values./:(zero)((s,x) =>(s._1 +x, s._2 + x*x)) //1 lazy val mean = sums._1/values.size //2 lazy val variance = (sums._2 - mean*mean*values...