Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

AdaMod is a stochastic optimizer that restricts adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks.

The weight updates are performed as:

$g\_{t} = \nabla{f}\_{t}\left(\theta\_{t-1}\right)$

$m\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t}$

$v\_{t} = \beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g\_{t}^{2}$

$\hat{m}\_{t} = m\_{t} / \left(1 - \beta^{t}\_{1}\right)$

$\hat{v}\_{t} = v\_{t} / \left(1 - \beta^{t}\_{2}\right)$

$\eta\_{t} = \alpha\_{t} / \left(\sqrt{\hat{v}\_{t}} + \epsilon\right)$

$s\_{t} = \beta\_{3}s\_{t-1} + (1-\beta\_{3})\eta\_{t}$

$\hat{\eta}\_{t} = \text{min}\left(\eta\_{t}, s\_{t}\right)$

$\theta\_{t} = \theta\_{t-1} - \hat{\eta}\_{t}\hat{m}\_{t}$

Description

The weight updates are performed as:

$g\_{t} = \nabla{f}\_{t}\left(\theta\_{t-1}\right)$

$m\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t}$

$v\_{t} = \beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g\_{t}^{2}$

$\hat{m}\_{t} = m\_{t} / \left(1 - \beta^{t}\_{1}\right)$

$\hat{v}\_{t} = v\_{t} / \left(1 - \beta^{t}\_{2}\right)$

$\eta\_{t} = \alpha\_{t} / \left(\sqrt{\hat{v}\_{t}} + \epsilon\right)$

$s\_{t} = \beta\_{3}s\_{t-1} + (1-\beta\_{3})\eta\_{t}$

$\hat{\eta}\_{t} = \text{min}\left(\eta\_{t}, s\_{t}\right)$

$\theta\_{t} = \theta\_{t-1} - \hat{\eta}\_{t}\hat{m}\_{t}$

AdaMod

Description

Papers Using This Method

AdaMod

Description

Papers Using This Method