Predicting hours of work for census respondents
In this recipe, we will build a simple linear regression model that will aim to predict the number of hours each of the census respondents works per week.
Getting ready
To execute this recipe, you need to have a working Spark environment. You would have already gone through the previous recipe where we created training and testing datasets for estimating regression models.
No other prerequisites are required.
How to do it...
Training models with MLlib is pretty straightforward. See the following code snippet:
workhours_model_lm = reg.LinearRegressionWithSGD.train(final_data_hours_train)
How it works...
As you can see, we first create the LinearRegressionWithSGD
object and call its .train(...)
method.
Note
For a very good overview of different derivatives of stochastic gradient descent, check this out: http://ruder.io/optimizing-gradient-descent/.
The first, and the only, required parameter we pass to the method is an RDD of labeled points that we created...