Let's start with a simple problem, predicting house prices in Boston; a problem for which we can use a publicly available dataset. We are given several demographic and geographical attributes, such as the crime rate or the pupil-teacher ratio in the neighborhood. The goal is to predict the median value of a house in a particular area. As usual, we have some training data, where the answer is known to us.
This is one of the built-in datasets that scikit-learn comes with, so it is very easy to load the data into memory:
>>> from sklearn.datasets import load_boston >>> boston = load_boston()
The boston
object contains several attributes; in particular, boston.data
contains the input data and boston.target
contains the price of houses.
We will start with a simple one-dimensional regression, trying to regress the price on a single attribute, the average number of rooms per dwelling in the neighborhood, which is stored at position 5
(you...