TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Nesterov Accelerated Gradient

Nesterov Accelerated Gradient

GeneralIntroduced 198334 papers

Description

Nesterov Accelerated Gradient is a momentum-based SGD optimizer that "looks ahead" to where the parameters will be to calculate the gradient ex post rather than ex ante:

v_t=γv_t−1−η∇_θJ(θt−1+γv_t−1)v\_{t} = \gamma{v}\_{t-1} - \eta\nabla\_{\theta}J\left(\theta_{t-1}+\gamma{v\_{t-1}}\right)v_t=γv_t−1−η∇_θJ(θt−1​+γv_t−1) θ_t=θ_t−1+v_t\theta\_{t} = \theta\_{t-1} + v\_{t}θ_t=θ_t−1+v_t γ,η∈R+\gamma, \eta \in \mathbb{R}^+γ,η∈R+

Like SGD with momentum γ\gammaγ is usually set to 0.90.90.9. η\etaη and γ\gammaγ are usually less than 111.

The intuition is that the standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient. In contrast Nesterov momentum first makes a big jump in the direction of the previous accumulated gradient and then measures the gradient where it ends up and makes a correction. The idea being that it is better to correct a mistake after you have made it.

Image Source: Geoff Hinton lecture notes

Papers Using This Method

Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters2025-06-13Nesterov Method for Asynchronous Pipeline Parallel Optimization2025-05-02Advancing RVFL networks: Robust classification with the HawkEye loss function2024-10-01An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness2024-09-28Optimizing Time Series Forecasting: A Comparative Study of Adam and Nesterov Accelerated Gradient on LSTM and GRU networks Using Stock Market data2024-09-28DenoMamba: A fused state-space model for low-dose CT denoising2024-09-193D CBCT Challenge 2024: Improved Cone Beam CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement2024-06-12Momentum-SAM: Sharpness Aware Minimization without Computational Overhead2024-01-22Accelerated gradient methods for nonconvex optimization: Escape trajectories from strict saddle points and convergence to local minima2023-07-13Riemannian accelerated gradient methods via extrapolation2022-08-13Last-iterate convergence analysis of stochastic momentum methods for neural networks2022-05-30Automated Parking Space Detection Using Convolutional Neural Networks2021-06-14A Discrete Variational Derivation of Accelerated Methods in Optimization2021-06-04A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes2021-02-12Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance2020-12-07A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks2020-10-25Accelerated Gradient Methods for Sparse Statistical Learning with Nonconvex Penalties2020-09-22Federated Learning with Nesterov Accelerated Gradient2020-09-18GreedyNAS: Towards Fast One-Shot NAS with Greedy Supernet2020-03-25Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent2020-02-24