Linear regression is the approach to model the value of a response variable y, based on one or more predictor variables or feature x.
Let's use some housing data to predict the price of a house based on its size. The following are the sizes and prices of houses in the City of Saratoga, CA, in early 2014:
House size (sq ft) |
Price |
---|---|
2100 |
$ 1,620,000 |
2300 |
$ 1,690,000 |
2046 |
$ 1,400,000 |
4314 |
$ 2,000,000 |
1244 |
$ 1,060,000 |
4608 |
$ 3,830,000 |
2173 |
$ 1,230,000 |
2750 |
$ 2,400,000 |
4010 |
$ 3,380,000 |
1959 |
$ 1,480,000 |
$ spark-shell
Import the statistics and related classes:
scala> import org.apache.spark.mllib.linalg.Vectors scala> import org.apache.spark.mllib.regression.LabeledPoint scala> import org.apache.spark.mllib.regression.LinearRegressionWithSGD
Create the
LabeledPoint
array with the house price as the label:scala> val points = Array( LabeledPoint(1620000...