TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/AdaGrad

AdaGrad

GeneralIntroduced 2011192 papers

Description

AdaGrad is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate η\etaη at each time step ttt for every parameter θ_i\theta\_{i}θ_i based on the past gradients for θ_i\theta\_{i}θ_i:

θ_t+1,i=θ_t,i−ηG_t,ii+ϵg_t,i\theta\_{t+1, i} = \theta\_{t, i} - \frac{\eta}{\sqrt{G\_{t, ii} + \epsilon}}g\_{t, i}θ_t+1,i=θ_t,i−G_t,ii+ϵ​η​g_t,i

The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of 0.010.010.01. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.

Image: Alec Radford

Papers Using This Method

Recursive Bound-Constrained AdaGrad with Applications to Multilevel and Domain Decomposition Minimization2025-07-15LightSAM: Parameter-Agnostic Sharpness-Aware Minimization2025-05-30Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation2025-05-20Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness2025-05-07Structured Preconditioners in Adaptive Optimization: A Unified Analysis2025-03-13Tractable Representations for Convergent Approximation of Distributional HJB Equations2025-03-07Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization2025-02-17Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and Future Directions2025-01-08Towards Simple and Provable Parameter-Free Adaptive Gradient Methods2024-12-27Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training2024-12-06A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation2024-11-19Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations2024-11-14New Insight in Cervical Cancer Diagnosis Using Convolution Neural Network Architecture2024-10-23Preconditioning for Accelerated Gradient Descent Optimization and Regularization2024-09-30Stability and convergence analysis of AdaGrad for non-convex optimization via novel stopping time-based techniques2024-09-08Causal Temporal Representation Learning with Nonstationary Sparse Transition2024-09-05Machine learning models for daily rainfall forecasting in Northern Tropical Africa using tropical wave predictors2024-08-29A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality2024-07-17AdaGrad under Anisotropic Smoothness2024-06-21Provable Complexity Improvement of AdaGrad over SGD: Upper and Lower Bounds in Stochastic Non-Convex Optimization2024-06-07