Description
Adaptive Softmax is a speedup technique for the computation of probability distributions over words. The adaptive softmax is inspired by the class-based hierarchical softmax, where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node and reducing the capacity of rare words.
Papers Using This Method
RLBenchNet: The Right Network for the Right Reinforcement Learning Task2025-05-21VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits2025-05-15Convergence Rates for Softmax Gating Mixture of Experts2025-03-05A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation2024-11-19Large Body Language Models2024-10-21DenoMamba: A fused state-space model for low-dose CT denoising2024-09-19Online Residual Learning from Offline Experts for Pedestrian Tracking2024-09-06Transformers for Supervised Online Continual Learning2024-03-03UniMem: Towards a Unified View of Long-Context Large Language Models2024-02-05Memory-efficient Stochastic methods for Memory-based Transformers2023-11-14TRAMS: Training-free Memory Selection for Long-range Language Modeling2023-10-24Approximating Two-Layer Feedforward Networks for Efficient Transformers2023-10-16Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents2023-09-29Random-Access Infinite Context Length for Transformers2023-09-21RCMHA: Relative Convolutional Multi-Head Attention for Natural Language Modelling2023-08-07Landmark Attention: Random-Access Infinite Context Length for Transformers2023-05-25Transformer-based World Models Are Happy With 100k Interactions2023-03-13GTR-CTRL: Instrument and Genre Conditioning for Guitar-Focused Music Generation with Transformers2023-02-10An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation2023-01-31Efficient Sparsely Activated Transformers2022-08-31