TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/SGD

SGD

Stochastic Gradient Descent

GeneralIntroduced 19512021 papers

Description

Stochastic Gradient Descent is an iterative optimization technique that uses minibatches of data to form an expectation of the gradient, rather than the full gradient using all available data. That is for weights www and a loss function LLL we have:

w_t+1=w_t−η∇^_wL(w_t)w\_{t+1} = w\_{t} - \eta\hat{\nabla}\_{w}{L(w\_{t})}w_t+1=w_t−η∇^_wL(w_t)

Where η\etaη is a learning rate. SGD reduces redundancy compared to batch gradient descent - which recomputes gradients for similar examples before each parameter update - so it is usually much faster.

(Image Source: here)

Papers Using This Method

Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime2025-07-15A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning2025-07-09Tight Generalization Error Bounds for Stochastic Gradient Descent in Non-convex Learning2025-06-23A Minimalist Optimizer Design for LLM Pretraining2025-06-20A Simplified Analysis of SGD for Linear Regression with Weight Averaging2025-06-18Sharpness-Aware Machine Unlearning2025-06-16Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling2025-06-14Learning single-index models via harmonic decomposition2025-06-11An Adaptive Method Stabilizing Activations for Enhanced Generalization2025-06-10Improved Scaling Laws in Linear Regression via Data Reuse2025-06-10Online Learning-guided Learning Rate Adaptation via Gradient Alignment2025-06-10Orthogonal Gradient Descent Improves Neural Calibration2025-06-04Classifying Dental Care Providers Through Machine Learning with Features Ranking2025-06-04Replay Can Provably Increase Forgetting2025-06-04Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order2025-06-04Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems2025-06-04Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks2025-06-03LightSAM: Parameter-Agnostic Sharpness-Aware Minimization2025-05-30SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training2025-05-29The Rich and the Simple: On the Implicit Bias of Adam and SGD2025-05-29