— —
Text version of this post can be found here.
Related Reading
- Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
- Lecture 4: Optimization Landscape of Deep Learning
- Lecture 13: Data Modelling and Distributions
AdaGrad Algorithm
The standard gradient decent is given by
A potential issue of this approach is that the learning rate is shared by all features in the parameters. This is appropriate if all features are similar in the sense that they contribute to the objective function in a similar way. Of course, this is not always true.
The matrix F has d rows and T columns. We can think of each row as a time series of the measure of sensitivity of the corresponding feature. The idea of AdaGrad algorithm is that features with low sensitivity could use relatively large learning rate and features with high sensitivity could use relatively small learning rate. This idea can be expressed by the formula below
The following approach is proposed in the paper: