TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/AdamW

AdamW

GeneralIntroduced 2000206 papers
Source Paper

Description

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, L_2L\_{2}L_2 regularization in Adam is usually implemented with the below modification where w_tw\_{t}w_t is the rate of the weight decay at time ttt:

g_t=∇f(θ_t)+w_tθ_t g\_{t} = \nabla{f\left(\theta\_{t}\right)} + w\_{t}\theta\_{t}g_t=∇f(θ_t)+w_tθ_t

while AdamW adjusts the weight decay term to appear in the gradient update:

θ_t+1,i=θ_t,i−η(1v^_t+ϵ⋅m^_t+w_t,iθ_t,i),∀t \theta\_{t+1, i} = \theta\_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}\_{t} + \epsilon}}\cdot{\hat{m}\_{t}} + w\_{t, i}\theta\_{t, i}\right), \forall{t}θ_t+1,i=θ_t,i−η(v^_t+ϵ​1​⋅m^_t+w_t,iθ_t,i),∀t

Papers Using This Method

I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution2025-06-18Improving LoRA with Variational Learning2025-06-17PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective2025-05-27AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training2025-05-22Enhancing Abstractive Summarization of Scientific Papers Using Structure Information2025-05-20A Physics-Inspired Optimizer: Velocity Regularized Adam2025-05-19Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training2025-05-19On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm2025-05-17Variational Visual Question Answering2025-05-14Practical Efficiency of Muon for Pretraining2025-05-04CacheFormer: High Attention-Based Segment Caching2025-04-18Learning from Streaming Video with Orthogonal Gradients2025-04-02Chirp Localization via Fine-Tuned Transformer Model: A Proof-of-Concept Study2025-03-24ARLED: Leveraging LED-based ARMAN Model for Abstractive Summarization of Persian Long Documents2025-03-13Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach2025-03-06The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training2025-02-26Muon is Scalable for LLM Training2025-02-24COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs2025-02-24Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs2025-02-21A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models2025-02-20