Adaptive Softmax

GeneralIntroduced 200072 papers

Description

Adaptive Softmax is a speedup technique for the computation of probability distributions over words. The adaptive softmax is inspired by the class-based hierarchical softmax, where the word classes are built to minimize the computation time. Adaptive softmax achieves efficiency by explicitly taking into account the computation time of matrix-multiplication on parallel systems and combining it with a few important observations, namely keeping a shortlist of frequent words in the root node and reducing the capacity of rare words.

Papers Using This Method

RLBenchNet: The Right Network for the Right Reinforcement Learning Task2025-05-21 VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits2025-05-15 Convergence Rates for Softmax Gating Mixture of Experts2025-03-05 A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation2024-11-19 Large Body Language Models2024-10-21 DenoMamba: A fused state-space model for low-dose CT denoising2024-09-19 Online Residual Learning from Offline Experts for Pedestrian Tracking2024-09-06 Transformers for Supervised Online Continual Learning2024-03-03 UniMem: Towards a Unified View of Long-Context Large Language Models2024-02-05 Memory-efficient Stochastic methods for Memory-based Transformers2023-11-14 TRAMS: Training-free Memory Selection for Long-range Language Modeling2023-10-24 Approximating Two-Layer Feedforward Networks for Efficient Transformers2023-10-16 Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents2023-09-29 Random-Access Infinite Context Length for Transformers2023-09-21 RCMHA: Relative Convolutional Multi-Head Attention for Natural Language Modelling2023-08-07 Landmark Attention: Random-Access Infinite Context Length for Transformers2023-05-25 Transformer-based World Models Are Happy With 100k Interactions2023-03-13 GTR-CTRL: Instrument and Genre Conditioning for Guitar-Focused Music Generation with Transformers2023-02-10 An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation2023-01-31 Efficient Sparsely Activated Transformers2022-08-31