AMSGrad is a stochastic optimization method that seeks to fix a convergence issue with Adam based optimizers. AMSGrad uses the maximum of past squared gradients v_tv\_{t}v_t rather than the exponential average to update the parameters:
m_t=β_1m_t−1+(1−β_1)g_tm\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t} m_t=β_1m_t−1+(1−β_1)g_t
v_t=β_2v_t−1+(1−β_2)g_t2v\_{t} = \beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g\_{t}^{2}v_t=β_2v_t−1+(1−β_2)g_t2
v^_t=max(v^_t−1,v_t)\hat{v}\_{t} = \max\left(\hat{v}\_{t-1}, v\_{t}\right)v^_t=max(v^_t−1,v_t)
θ_t+1=θ_t−ηv^t+ϵm_t\theta\_{t+1} = \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}_{t}} + \epsilon}m\_{t}θ_t+1=θ_t−v^t+ϵηm_t