Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Calculating summary statistics


Summary statistics is used to summarize observations to get a collective sense of the data. The summary includes the following:

  • Central tendency of data—mean, mode, median

  • Spread of data—variance, standard deviation

  • Boundary conditions—min, max

This recipe covers how to produce summary statistics.

How to do it…

  1. Start the Spark shell:

    $ spark-shell
    
  2. Import the matrix-related classes:

    scala> import org.apache.spark.mllib.linalg.{Vectors,Vector}
    scala> import org.apache.spark.mllib.stat.Statistics
    
  3. Create a personRDD as RDD of vectors:

    scala> val personRDD = sc.parallelize(List(Vectors.dense(150,60,25), Vectors.dense(300,80,40)))
    
  4. Compute the column summary statistics:

    scala> val summary = Statistics.colStats(personRDD)
    
  5. Print the mean of this summary:

    scala> print(summary.mean)
    
  6. Print the variance:

    scala> print(summary.variance)
    
  7. Print the non-zero values in each column:

    scala> print(summary.numNonzeros)
    
  8. Print the sample size:

    scala> print(summary...