The k-NN algorithm is an algorithm that works based on distance. When a new data point is thrown into the dataset and the algorithm is given the task of classifying this new data point, it uses distance to check the points that are closest to it.
If we have features that have different ranges of values – for example, feature one has a range between 0 to 800 while feature two has a range between one to five – this distance metric does not make sense anymore. We want all the features to have the same range of values so that the distance metric is on level terms across all features.
One way to do this is to subtract each value of each feature by the mean of that feature and divide by the variance of that feature. This is called standardization:
We can do this for our dataset by using the following code:
from sklearn.preprocessing...