MARS: Unleashing the Power of Variance Reduction for Training Large Models

Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu

2024-11-15Stochastic Optimization

Abstract

Training deep neural networks--and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin. The implementation of MARS is available at https://github.com/AGI-Arena/MARS.

Related Papers

First-order methods for stochastic and finite-sum convex optimization with deterministic constraints2025-06-25 Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters2025-06-13 Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery2025-06-12 The Sample Complexity of Parameter-Free Stochastic Convex Optimization2025-06-12 "What are my options?": Explaining RL Agents with Diverse Near-Optimal Alternatives (Extended)2025-06-11 PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning2025-05-28 Online distributed optimization for spatio-temporally constrained real-time peer-to-peer energy trading2025-05-28 Distribution free M-estimation2025-05-28