TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/AdaMod

AdaMod

GeneralIntroduced 20001 papers
Source Paper

Description

AdaMod is a stochastic optimizer that restricts adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks.

The weight updates are performed as:

g_t=∇f_t(θ_t−1)g\_{t} = \nabla{f}\_{t}\left(\theta\_{t-1}\right)g_t=∇f_t(θ_t−1)

m_t=β_1m_t−1+(1−β_1)g_tm\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t}m_t=β_1m_t−1+(1−β_1)g_t

v_t=β_2v_t−1+(1−β_2)g_t2v\_{t} = \beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g\_{t}^{2}v_t=β_2v_t−1+(1−β_2)g_t2

m^_t=m_t/(1−βt_1) \hat{m}\_{t} = m\_{t} / \left(1 - \beta^{t}\_{1}\right)m^_t=m_t/(1−βt_1)

v^_t=v_t/(1−βt_2) \hat{v}\_{t} = v\_{t} / \left(1 - \beta^{t}\_{2}\right)v^_t=v_t/(1−βt_2)

η_t=α_t/(v^_t+ϵ)\eta\_{t} = \alpha\_{t} / \left(\sqrt{\hat{v}\_{t}} + \epsilon\right)η_t=α_t/(v^_t​+ϵ)

s_t=β_3s_t−1+(1−β_3)η_ts\_{t} = \beta\_{3}s\_{t-1} + (1-\beta\_{3})\eta\_{t}s_t=β_3s_t−1+(1−β_3)η_t

η^_t=min(η_t,s_t)\hat{\eta}\_{t} = \text{min}\left(\eta\_{t}, s\_{t}\right)η^​_t=min(η_t,s_t)

θ_t=θ_t−1−η^_tm^_t\theta\_{t} = \theta\_{t-1} - \hat{\eta}\_{t}\hat{m}\_{t}θ_t=θ_t−1−η^​_tm^_t

Papers Using This Method

An Adaptive and Momental Bound Method for Stochastic Learning2019-10-27