For each person, we are given their age, yearly income, and whether or not they own a house:
Age | Annual income in USD | House ownership status |
23 | 50,000 | Non-owner |
37 | 34,000 | Non-owner |
48 | 40,000 | Owner |
52 | 30,000 | Non-owner |
28 | 95,000 | Owner |
25 | 78,000 | Non-owner |
35 | 130,000 | Owner |
32 | 105,000 | Owner |
20 | 100,000 | Non-owner |
40 | 60,000 | Owner |
50 | 80,000 | Peter |
House ownership and annual income
The aim is to predict whether Peter, aged 50, with an income of $80,000 per year, owns a house and could be a potential customer for our insurance company.
In this case, we could try to apply the 1-NN algorithm. However, we should be careful about how we measure the distances between the data points, since the income range is much wider than the age range. Income levels of USD 115 k and USD 116 k are USD 1,000 apart. The two data points for these incomes would be very far apart. However, relative to each other, the difference between these data points isn't actually that big. Because we consider both measures (age and yearly income) to be about as important as each other, we would scale both from 0 to 1 according to the following formula:
In our particular case, this reduces to the following:
After scaling, we get the following data:
Age | Scaled age | Annual income in USD | Scaled annual income | House ownership status |
23 | 0.09375 | 50,000 | 0.2 | Non-owner |
37 | 0.53125 | 34,000 | 0.04 | Non-owner |
48 | 0.875 | 40,000 | 0.1 | Owner |
52 | 1 | 30,000 | 0 | Non-owner |
28 | 0.25 | 95,000 | 0.65 | Owner |
25 | 0.15625 | 78,000 | 0.48 | Non-owner |
35 | 0.46875 | 130,000 | 1 | Owner |
32 | 0.375 | 105,000 | 0.75 | Owner |
20 | 0 | 100,000 | 0.7 | Non-owner |
40 | 0.625 | 60,000 | 0.3 | Owner |
50 | 0.9375 | 80,000 | 0.5 | ? |
Now, if we apply the 1-NN algorithm with the Euclidean metric, we will find out that Peter more than likely owns a house. Note that, without rescaling, the algorithm would yield a different result. Refer to Exercise 1.5 for more information.