Linear regression is the approach to model the value of a response or outcome variable y, based on one or more predictor variables or features, represented by x.
Let's use some housing data to predict the price of a house based on its size. The following are the sizes and prices of houses in the City of Saratoga, CA, in early 2014:
House size (sq. ft.) | Price |
---|---|
2100 | $ 1,620,000 |
2300 | $ 1,690,000 |
2046 | $ 1,400,000 |
4314 | $ 2,000,000 |
1244 | $ 1,060,000 |
4608 | $ 3,830,000 |
2173 | $ 1,230,000 |
2750 | $ 2,400,000 |
4010 | $ 3,380,000 |
1959 | $ 1,480,000 |
Here's a graphical representation of the same:
- Start the Spark shell:
$ spark-shell
- Import the statistics and related classes:
scala> import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.regression.LinearRegression
- Create a DataFrame with the house price as the label:
scala> val points = spark.createDataFrame(Seq(
(1620000,Vectors.dense(2100)),
(1690000,Vectors.dense(2300)),
(1400000,Vectors.dense(2046)),
...