Language Modelling on One Billion Word

Metric: PPL (lower is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

#	Model↕	PPL▲	Extra Data	Paper	Date↕	Code
1	MDLM (AR baseline)	20.09	No	Simple and Effective Masked Diffusion Language M...	2024-06-11	Code
2	OmniNetT (Large)	21.5	No	OmniNet: Omnidirectional Representations from Tr...	2021-03-01	Code
3	OmniNetP (Large)	21.6	No	OmniNet: Omnidirectional Representations from Tr...	2021-03-01	Code
4	Transformer-XL Large	21.8	No	Transformer-XL: Attentive Language Models Beyond...	2019-01-09	Code
5	OmniNetB (Large)	22	No	OmniNet: Omnidirectional Representations from Tr...	2021-03-01	Code
6	MDLM	23	No	Simple and Effective Masked Diffusion Language M...	2024-06-11	Code
7	Adaptive Input Very Large	23.02	No	Adaptive Input Representations for Neural Langua...	2018-09-28	Code
8	Transformer-XL Base	23.5	No	Transformer-XL: Attentive Language Models Beyond...	2019-01-09	Code
9	SRU++ Large	23.5	No	When Attention Meets Fast Recurrence: Training L...	2021-02-24	Code
10	10 LSTM+CNN inputs + SNM10-SKIP (ensemble)	23.7	No	Exploring the Limits of Language Modeling	2016-02-07	Code
11	Adaptive Input Large	23.91	No	Adaptive Input Representations for Neural Langua...	2018-09-28	Code
12	Mesh Tensorflow	24	No	Mesh-TensorFlow: Deep Learning for Supercomputers	2018-11-05	Code
13	Cohere Large	25.06	No	-	-	-
14	SRU++	25.1	No	When Attention Meets Fast Recurrence: Training L...	2021-02-24	Code
15	DynamicConv	26.67	No	Pay Less Attention with Lightweight and Dynamic ...	2019-01-29	Code
16	High-Budget MoE	28	No	Outrageously Large Neural Networks: The Sparsely...	2017-01-23	Code
17	Evolved Transformer Big	28.6	No	The Evolved Transformer	2019-01-30	Code
18	LSTM-8192-1024 + CNN Input	30	No	Exploring the Limits of Language Modeling	2016-02-07	Code
19	LSTM-8192-1024	30.6	No	Exploring the Limits of Language Modeling	2016-02-07	Code
20	GCNN-14 bottleneck	31.9	No	Language Modeling with Gated Convolutional Netwo...	2016-12-23	Code
21	Low-Budget MoE	34.1	No	Outrageously Large Neural Networks: The Sparsely...	2017-01-23	Code
22	BIG G-LSTM-2	36	No	Factorization tricks for LSTM networks	2017-03-31	Code
23	GPT-2	42.16	Yes	-	-	Code
24	RNN-1024 + 9 Gram	51.3	No	One Billion Word Benchmark for Measuring Progres...	2013-12-11	Code
25	Sparse Non-Negative	52.9	No	Skip-gram Language Modeling Using Sparse Non-neg...	2014-12-03	-