Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets, respectively. Results are obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single modern GPU.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Language Modelling | Penn Treebank (Character Level) | Bit per Character (BPC) | 1.175 | 3-layer AWD-LSTM |
| Language Modelling | Penn Treebank (Character Level) | Bit per Character (BPC) | 1.187 | 6-layer QRNN |
| Language Modelling | WikiText-103 | Test perplexity | 33 | 4 layer QRNN |
| Language Modelling | WikiText-103 | Validation perplexity | 32 | 4 layer QRNN |
| Language Modelling | Hutter Prize | Bit per Character (BPC) | 1.232 | 3-layer AWD-LSTM |
| Language Modelling | enwik8 | Bit per Character (BPC) | 1.232 | AWD-LSTM (3 layers) |