If you have been following the discussion on SVMs closely, you will have noticed a fundamental limitation in the way an SVM operates. We have discussed how the SVM classifier fits a hyperplane to our data such that the margin of separation between the classes is maximized. Now, doing that involves a very strong assumption. It assumes that our data is linearly separable. This is another fancy way of saying that we can separate the data using geometrical structures such as straight lines or planes (hyperplanes, in general). What would happen if our data is non-linearly separable.
For example, try as hard as you would, there is simply no way that you can fit a straight line that can separate the two classes of data in the following image:
As you can see, the decision boundary here is highly non-linear. So, how does an SVM classifier overcome this? The answer is known as the kernel trick.
The basic idea behind the kernel trick is that even if our data is not linearly separable...