Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Calculating correlation


Correlation is a statistical relationship between two variables such that when one variable changes, it leads to a change in the other variable. Correlation analysis measures the extent to which the two variables are correlated.

If an increase in one variable leads to an increase in another, it is called a positive correlation. If an increase in one variable leads to a decrease in the other, it is a negative correlation.

Spark supports two correlation algorithms: Pearson and Spearman. Pearson algorithm works with two continuous variables, such as a person's height and weight or house size and house price. Spearman deals with one continuous and one categorical variable, for example, zip code and house price.

Getting ready

Let's use some real data so that we can calculate correlation more meaningfully. The following are the size and price of houses in the City of Saratoga, California, in early 2014:

House size (sq ft)

Price

2100

$1,620,000

2300

$1,690,000

2046

$1...