Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

AdamW

GeneralIntroduced 2000206 papers

Description

AdamW is a stochastic optimization method that modifies the typical implementation of weight decay in Adam, by decoupling weight decay from the gradient update. To see this, $L\_{2}$ regularization in Adam is usually implemented with the below modification where $w\_{t}$ is the rate of the weight decay at time $t$ :

$g\_{t} = \nabla{f\left(\theta\_{t}\right)} + w\_{t}\theta\_{t}$

while AdamW adjusts the weight decay term to appear in the gradient update:

$\theta\_{t+1, i} = \theta\_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}\_{t} + \epsilon}}\cdot{\hat{m}\_{t}} + w\_{t, i}\theta\_{t, i}\right), \forall{t}$

Papers Using This Method

I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution2025-06-18 Improving LoRA with Variational Learning2025-06-17 PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective2025-05-27 AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training2025-05-22 Enhancing Abstractive Summarization of Scientific Papers Using Structure Information2025-05-20 A Physics-Inspired Optimizer: Velocity Regularized Adam2025-05-19 Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training2025-05-19 On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm2025-05-17 Variational Visual Question Answering2025-05-14 Practical Efficiency of Muon for Pretraining2025-05-04 CacheFormer: High Attention-Based Segment Caching2025-04-18 Learning from Streaming Video with Orthogonal Gradients2025-04-02 Chirp Localization via Fine-Tuned Transformer Model: A Proof-of-Concept Study2025-03-24 ARLED: Leveraging LED-based ARMAN Model for Abstractive Summarization of Persian Long Documents2025-03-13 Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach2025-03-06 The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training2025-02-26 Muon is Scalable for LLM Training2025-02-24 COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs2025-02-24 Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs2025-02-21 A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models2025-02-20