TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/AdaDelta

AdaDelta

GeneralIntroduced 200017 papers
Source Paper

Description

AdaDelta is a stochastic optimization technique that allows for per-dimension learning rate method for SGD. It is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to a fixed size www.

Instead of inefficiently storing www previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average E[g2]_tE\left[g^{2}\right]\_{t}E[g2]_t at time step ttt then depends only on the previous average and current gradient:

E[g2]_t=γE[g2]_t−1+(1−γ)g2_tE\left[g^{2}\right]\_{t} = \gamma{E}\left[g^{2}\right]\_{t-1} + \left(1-\gamma\right)g^{2}\_{t}E[g2]_t=γE[g2]_t−1+(1−γ)g2_t

Usually γ\gammaγ is set to around 0.90.90.9. Rewriting SGD updates in terms of the parameter update vector:

Δθt=−η⋅g_t,i \Delta\theta_{t} = -\eta\cdot{g\_{t, i}}Δθt​=−η⋅g_t,i θ_t+1=θ_t+Δθt\theta\_{t+1} = \theta\_{t} + \Delta\theta_{t}θ_t+1=θ_t+Δθt​

AdaDelta takes the form:

Δθt=−ηE[g2]_t+ϵgt\Delta\theta_{t} = -\frac{\eta}{\sqrt{E\left[g^{2}\right]\_{t} + \epsilon}}g_{t}Δθt​=−E[g2]_t+ϵ​η​gt​

The main advantage of AdaDelta is that we do not need to set a default learning rate.

Papers Using This Method

Accelerating Energy-Efficient Federated Learning in Cell-Free Networks with Adaptive Quantization2024-12-30New Insight in Cervical Cancer Diagnosis Using Convolution Neural Network Architecture2024-10-23A Parallelized, Adam-Based Solver for Reserve and Security Constrained AC Unit Commitment2023-10-10ELRA: Exponential learning rate adaption gradient descent optimization method2023-09-12Occupant's Behavior and Emotion Based Indoor Environment's Illumination Regulation2023-02-19BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization2022-07-06Gradient Descent, Stochastic Optimization, and Other Tales2022-05-02AdaSmooth: An Adaptive Learning Rate Method based on Effective Ratio2022-04-02Adaptively Customizing Activation Functions for Various Layers2021-12-17Tom: Leveraging trend of the observed gradients for faster convergence2021-09-07Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks2021-03-26An Adaptive and Momental Bound Method for Stochastic Learning2019-10-27diffGrad: An Optimization Method for Convolutional Neural Networks2019-09-12On the Convergence of Adam and Beyond2019-04-19Adaptive Methods for Nonconvex Optimization2018-12-01Online Batch Selection for Faster Training of Neural Networks2015-11-19ADADELTA: An Adaptive Learning Rate Method2012-12-22