The L2 penalty, also known as ridge regression, is similar in many ways to the L1 penalty, but instead of adding a penalty based on the sum of the absolute weights, the penalty is based on the squared weights. This has the effect of providing a varied penalty, with larger (positive or negative) weights resulting in a greater penalty. In the context of neural networks, this is sometimes referred to as weight decay. If you examine the gradient of the regularized objective function, there is a penalty such that, at every update, there is a multiplicative penalty to the weights. As for the L1 penalty, although they could be included, biases or offsets are usually excluded from this.
From the perspective of a linear regression problem, the L2 penalty is a modification to the objective function minimized, from (Y – XB)T(Y – XB) to (Y – XB)T(Y – XB) + 0.5λBTB . As with the L1 penalty, the L2 penalty can allow otherwise undetermined problems to be solved, particularly when the covariance...