AdaGrad Algorithm

Ryan
2 min readDec 17, 2020

--

— —

Text version of this post can be found here.

Related Reading

AdaGrad Algorithm

The standard gradient decent is given by

A potential issue of this approach is that the learning rate is shared by all features in the parameters. This is appropriate if all features are similar in the sense that they contribute to the objective function in a similar way. Of course, this is not always true.

The matrix F has d rows and T columns. We can think of each row as a time series of the measure of sensitivity of the corresponding feature. The idea of AdaGrad algorithm is that features with low sensitivity could use relatively large learning rate and features with high sensitivity could use relatively small learning rate. This idea can be expressed by the formula below

The following approach is proposed in the paper:

--

--

No responses yet