TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/AMSGrad

AMSGrad

GeneralIntroduced 200049 papers
Source Paper

Description

AMSGrad is a stochastic optimization method that seeks to fix a convergence issue with Adam based optimizers. AMSGrad uses the maximum of past squared gradients v_tv\_{t}v_t rather than the exponential average to update the parameters:

m_t=β_1m_t−1+(1−β_1)g_tm\_{t} = \beta\_{1}m\_{t-1} + \left(1-\beta\_{1}\right)g\_{t} m_t=β_1m_t−1+(1−β_1)g_t

v_t=β_2v_t−1+(1−β_2)g_t2v\_{t} = \beta\_{2}v\_{t-1} + \left(1-\beta\_{2}\right)g\_{t}^{2}v_t=β_2v_t−1+(1−β_2)g_t2

v^_t=max⁡(v^_t−1,v_t)\hat{v}\_{t} = \max\left(\hat{v}\_{t-1}, v\_{t}\right)v^_t=max(v^_t−1,v_t)

θ_t+1=θ_t−ηv^t+ϵm_t\theta\_{t+1} = \theta\_{t} - \frac{\eta}{\sqrt{\hat{v}_{t}} + \epsilon}m\_{t}θ_t+1=θ_t−v^t​​+ϵη​m_t

Papers Using This Method

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation2024-10-14MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence2024-05-24MADA: Meta-Adaptive Optimizers through hyper-gradient Descent2024-01-17FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup for Non-IID Data2023-09-18Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods2023-05-21UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization2023-05-09$\mathcal{C}^k$-continuous Spline Approximation with TensorFlow Gradient Descent Optimizers2023-03-22AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks2023-03-01Optimization Methods in Deep Learning: A Comprehensive Overview2023-02-19Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradient2022-10-24Communication-Efficient Adam-Type Algorithms for Distributed Data Mining2022-10-14Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization2022-06-01On Distributed Adaptive Optimization with Gradient Compression2022-05-11AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization2022-01-18Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates2022-01-05A Novel Convergence Analysis for Algorithms of the Adam Family2021-12-07Convergence of adaptive algorithms for constrained weakly convex optimization2021-12-01Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization2021-11-01SGD Can Converge to Local Maxima2021-09-29On the Convergence of Decentralized Adaptive Gradient Methods2021-09-07