TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Adafactor

Adafactor

GeneralIntroduced 2000733 papers
Source Paper

Description

Adafactor is a stochastic optimization method based on Adam that reduces memory usage while retaining the empirical benefits of adaptivity. This is achieved through maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an n×mn \times mn×m matrix, this reduces the memory requirements from O(nm)O(n m)O(nm) to O(n+m)O(n + m)O(n+m).

Instead of defining the optimization algorithm in terms of absolute step sizes {αt\alpha_tαt​}_t=1T\_{t=1}^T_t=1T, the authors define the optimization algorithm in terms of relative step sizes {ρt\rho_tρt​}_t=1T\_{t=1}^T_t=1T, which get multiplied by the scale of the parameters. The scale of a parameter vector or matrix is defined as the root-mean-square of its components, lower-bounded by a small constant ϵ2\epsilon_2ϵ2​. The reason for this lower bound is to allow zero-initialized parameters to escape 0.

Proposed hyperparameters are: ϵ_1=10−30\epsilon\_{1} = 10^{-30}ϵ_1=10−30, ϵ_2=10−3\epsilon\_{2} = 10^{-3}ϵ_2=10−3, d=1d=1d=1, p_t=min⁡(10−2,1t)p\_{t} = \min\left(10^{-2}, \frac{1}{\sqrt{t}}\right)p_t=min(10−2,t​1​), β^_2_t=1−t−0.8\hat{\beta}\_{2\_{t}} = 1 - t^{-0.8}β^​_2_t=1−t−0.8.

Papers Using This Method

Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems2025-07-08I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution2025-06-18Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription2025-06-17A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation2025-06-09A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair2025-06-05Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking2025-05-29ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation2025-05-23LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming2025-05-21EEG-to-Text Translation: A Model for Deciphering Human Brain Activity2025-05-20Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation2025-05-16Multilingual Machine Translation with Quantum Encoder Decoder Attention-based Convolutional Variational Circuits2025-05-14Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization2025-05-08Cardioformer: Advancing AI in ECG Analysis with Multi-Granularity Patching and ResNet2025-05-08GASCADE: Grouped Summarization of Adverse Drug Event for Enhanced Cancer Pharmacovigilance2025-05-07A review of DNA restriction-free overlapping sequence cloning techniques for synthetic biology2025-05-06JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry2025-04-29Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks2025-04-28Fast-Powerformer: A Memory-Efficient Transformer for Accurate Mid-Term Wind Power Forecasting2025-04-15Sigma: A dataset for text-to-code semantic parsing with statistical analysis2025-04-05Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions2025-03-30