AdaDelta solves the problem of the decreasing learning rate in AdaGrad. In AdaGrad, the learning rate is computed as 1 divided by the sum of square roots. At each stage, we add another square root to the sum, which causes the denominator to decrease constantly. Now, instead of summing all prior square roots, it uses a sliding window that allows the sum to decrease.
AdaDelta is an extension of AdaGrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, AdaDelta restricts the window of accumulated past gradients to some fixed size, w.
Instead of inefficiently storing w past squared gradients, the sum of the gradients is recursively defined as a decaying average of all past squared gradients. The running average, E[g2]t, at time step t then depends (as a fraction, γ, similar to the momentum term) only on the previous average and the current gradient:
Where * E[g2]t is the squared sum of gradients...