An Adaptive and Momental Bound Method for Stochastic Learning

Jianbang Ding, Xuancheng Ren, Ruixuan Luo, Xu sun

2019-10-27Stochastic Optimization

Abstract

Training deep neural networks requires intricate initialization and careful selection of learning rates. The emergence of stochastic gradient optimization methods that use adaptive learning rates based on squared past gradients, e.g., AdaGrad, AdaDelta, and Adam, eases the job slightly. However, such methods have also been proven problematic in recent studies with their own pitfalls including non-convergence issues and so on. Alternative variants have been proposed for enhancement, such as AMSGrad, AdaShift and AdaBound. In this work, we identify a new problem of adaptive learning rate methods that exhibits at the beginning of learning where Adam produces extremely large learning rates that inhibit the start of learning. We propose the Adaptive and Momental Bound (AdaMod) method to restrict the adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks. Our experiments verify that AdaMod eliminates the extremely large learning rates throughout the training and brings significant improvements especially on complex networks such as DenseNet and Transformer, compared to Adam. Our implementation is available at: https://github.com/lancopku/AdaMod

Related Papers

First-order methods for stochastic and finite-sum convex optimization with deterministic constraints2025-06-25 Convergence of Momentum-Based Optimization Algorithms with Time-Varying Parameters2025-06-13 Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery2025-06-12 The Sample Complexity of Parameter-Free Stochastic Convex Optimization2025-06-12 "What are my options?": Explaining RL Agents with Diverse Near-Optimal Alternatives (Extended)2025-06-11 PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning2025-05-28 Online distributed optimization for spatio-temporally constrained real-time peer-to-peer energy trading2025-05-28 Distribution free M-estimation2025-05-28