Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen
We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Language Modelling | Penn Treebank (Word Level) | Test perplexity | 47.69 | AWD-LSTM-MoS + dynamic eval |
| Language Modelling | Penn Treebank (Word Level) | Validation perplexity | 48.33 | AWD-LSTM-MoS + dynamic eval |
| Language Modelling | Penn Treebank (Word Level) | Test perplexity | 54.44 | AWD-LSTM-MoS |
| Language Modelling | Penn Treebank (Word Level) | Validation perplexity | 56.54 | AWD-LSTM-MoS |
| Language Modelling | WikiText-2 | Test perplexity | 40.68 | AWD-LSTM-MoS + dynamic eval |
| Language Modelling | WikiText-2 | Validation perplexity | 42.41 | AWD-LSTM-MoS + dynamic eval |
| Language Modelling | WikiText-2 | Test perplexity | 61.45 | AWD-LSTM-MoS |
| Language Modelling | WikiText-2 | Validation perplexity | 63.88 | AWD-LSTM-MoS |