Natural Language Inference on MultiNLI

Metric: Mismatched (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Mismatched▼	Extra Data	Paper	Date↕	Code
1	Turing NLR v5 XXL 5.4B (fine-tuned)	92.4	No	-	-	-
2	T5	91.7	No	SMART: Robust and Efficient Fine-Tuning for Pre-...	2019-11-08	Code
3	T5-11B	91.7	No	Exploring the Limits of Transfer Learning with a...	2019-10-23	Code
4	T5-3B	91.2	No	Exploring the Limits of Transfer Learning with a...	2019-10-23	Code
5	DeBERTa (large)	91.1	No	DeBERTa: Decoding-enhanced BERT with Disentangle...	2020-06-05	Code
6	Adv-RoBERTa ensemble	90.7	No	StructBERT: Incorporating Language Structures in...	2019-08-13	-
7	RoBERTa (ensemble)	90.2	No	RoBERTa: A Robustly Optimized BERT Pretraining A...	2019-07-26	Code
8	T5-Large 770M	89.6	No	Exploring the Limits of Transfer Learning with a...	2019-10-23	Code
9	ERNIE 2.0 Large	88.8	No	ERNIE 2.0: A Continual Pre-training Framework fo...	2019-07-29	Code
10	BERT-Large	88	No	FNet: Mixing Tokens with Fourier Transforms	2021-05-09	Code
11	MT-DNN-ensemble	87.4	No	Improving Multi-Task Deep Neural Networks via Kn...	2019-04-20	Code
12	Snorkel MeTaL (ensemble)	87.2	No	Training Complex Models with Multi-Task Weak Sup...	2018-10-05	Code
13	gMLP-large	86.5	No	Pay Attention to MLPs	2021-05-17	Code
14	RealFormer	86.34	No	RealFormer: Transformer Likes Residual Attention	2020-12-21	Code
15	T5-Base	86.2	No	Exploring the Limits of Transfer Learning with a...	2019-10-23	Code
16	MT-DNN	86	No	Multi-Task Deep Neural Networks for Natural Lang...	2019-01-31	Code
17	BERT-LARGE	85.9	No	BERT: Pre-training of Deep Bidirectional Transfo...	2018-10-11	Code
18	ERNIE 2.0 Base	85.5	No	ERNIE 2.0: A Continual Pre-training Framework fo...	2019-07-29	Code
19	ELC-BERT-base 98M (zero init)	84.5	No	Not all layers are equally as important: Every L...	2023-11-03	-
20	Charformer-Tall	84.4	No	Charformer: Fast Character Transformers via Grad...	2021-06-23	Code
21	24hBERT	83.8	No	How to Train BERT with an Academic Budget	2021-04-15	Code
22	LTG-BERT-base 98M	83.4	No	Not all layers are equally as important: Every L...	2023-11-03	-
23	TinyBERT-6 67M	83.2	No	TinyBERT: Distilling BERT for Natural Language U...	2019-09-23	Code
24	ERNIE	83.2	No	ERNIE: Enhanced Language Representation with Inf...	2019-05-17	Code
25	T5-Small	82.3	No	Exploring the Limits of Transfer Learning with a...	2019-10-23	Code
26	GPST(unsupervised generative syntactic LM)	82	No	Generative Pretrained Structured Transformers: U...	2024-03-13	Code
27	TinyBERT-4 14.5M	81.8	No	TinyBERT: Distilling BERT for Natural Language U...	2019-09-23	Code
28	MFAE	81.43	No	-	-	Code
29	Finetuned Transformer LM	81.4	No	-	-	-
30	Finetuned Transformer LM	81.4	No	-	-	Code
31	SqueezeBERT	81.1	No	SqueezeBERT: What can computer vision teach NLP ...	2020-06-19	Code
32	ELC-BERT-small 24M	79.9	No	Not all layers are equally as important: Every L...	2023-11-03	-
33	LTG-BERT-small 24M	78.8	No	Not all layers are equally as important: Every L...	2023-11-03	-
34	FNet-Large	76	No	FNet: Mixing Tokens with Fourier Transforms	2021-05-09	Code
35	aESIM	73.9	No	Attention Boosted Sequential Inference Model	2018-12-05	-
36	Stacked Bi-LSTMs (shortcut connections, max-pooling)	72.2	No	Combining Similarity Features and Deep Represent...	2018-11-02	Code
37	Multi-task BiLSTM + Attn	72.1	No	GLUE: A Multi-Task Benchmark and Analysis Platfo...	2018-04-20	Code
38	T5-Large 738M	72	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
39	GenSen	71.3	No	Learning General Purpose Distributed Sentence Re...	2018-03-30	Code
40	Bi-LSTM sentence encoder (max-pooling)	71.1	No	Combining Similarity Features and Deep Represent...	2018-11-02	Code
41	Stacked Bi-LSTMs (shortcut connections, max-pooling, attention)	70.5	No	Combining Similarity Features and Deep Represent...	2018-11-02	Code
42	LaMini-GPT 1.5B	69.3	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
43	SWEM-max	67.7	No	Baseline Needs More Love: On Simple Word-Embeddi...	2018-05-24	Code
44	LaMini-F-T5 783M	61	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
45	LaMini-T5 738M	55.8	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code
46	GPT-2-XL 1.5B	37	No	LaMini-LM: A Diverse Herd of Distilled Models fr...	2023-04-27	Code

#1Turing NLR v5 XXL 5.4B (fine-tuned)
92.4
Mismatched
No paper
#2T5
91.7
Mismatched· 2019-11-08
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization Code
#3T5-11BSOTA
91.7
Mismatched· 2019-10-23
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Code
#4T5-3B
91.2
Mismatched· 2019-10-23
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Code
#5DeBERTa (large)
91.1
Mismatched· 2020-06-05
DeBERTa: Decoding-enhanced BERT with Disentangled Attention Code
#6Adv-RoBERTa ensembleSOTA
90.7
Mismatched· 2019-08-13
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
#7RoBERTa (ensemble)SOTA
90.2
Mismatched· 2019-07-26
RoBERTa: A Robustly Optimized BERT Pretraining Approach Code
#8T5-Large 770M
89.6
Mismatched· 2019-10-23
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Code
#9ERNIE 2.0 Large
88.8
Mismatched· 2019-07-29
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding Code
#10BERT-Large
88
Mismatched· 2021-05-09
FNet: Mixing Tokens with Fourier Transforms Code
#11MT-DNN-ensembleSOTA
87.4
Mismatched· 2019-04-20
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding Code
#12Snorkel MeTaL (ensemble)SOTA
87.2
Mismatched· 2018-10-05
Training Complex Models with Multi-Task Weak Supervision Code
#13gMLP-large
86.5
Mismatched· 2021-05-17
Pay Attention to MLPs Code
#14RealFormer
86.34
Mismatched· 2020-12-21
RealFormer: Transformer Likes Residual Attention Code
#15T5-Base
86.2
Mismatched· 2019-10-23
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Code
#16MT-DNN
86
Mismatched· 2019-01-31
Multi-Task Deep Neural Networks for Natural Language Understanding Code
#17BERT-LARGE
85.9
Mismatched· 2018-10-11
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Code
#18ERNIE 2.0 Base
85.5
Mismatched· 2019-07-29
ERNIE 2.0: A Continual Pre-training Framework for Language Understanding Code
#19ELC-BERT-base 98M (zero init)
84.5
Mismatched· 2023-11-03
Not all layers are equally as important: Every Layer Counts BERT
#20Charformer-Tall
84.4
Mismatched· 2021-06-23
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization Code
#2124hBERT
83.8
Mismatched· 2021-04-15
How to Train BERT with an Academic Budget Code
#22LTG-BERT-base 98M
83.4
Mismatched· 2023-11-03
Not all layers are equally as important: Every Layer Counts BERT
#23TinyBERT-6 67M
83.2
Mismatched· 2019-09-23
TinyBERT: Distilling BERT for Natural Language Understanding Code
#24ERNIE
83.2
Mismatched· 2019-05-17
ERNIE: Enhanced Language Representation with Informative Entities Code
#25T5-Small
82.3
Mismatched· 2019-10-23
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Code
#26GPST(unsupervised generative syntactic LM)
82
Mismatched· 2024-03-13
Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale Code
#27TinyBERT-4 14.5M
81.8
Mismatched· 2019-09-23
TinyBERT: Distilling BERT for Natural Language Understanding Code
#28MFAE
81.43
Mismatched
No paperCode
#29Finetuned Transformer LM
81.4
Mismatched
No paper
#30Finetuned Transformer LM
81.4
Mismatched
No paperCode
#31SqueezeBERT
81.1
Mismatched· 2020-06-19
SqueezeBERT: What can computer vision teach NLP about efficient neural networks?Code
#32ELC-BERT-small 24M
79.9
Mismatched· 2023-11-03
Not all layers are equally as important: Every Layer Counts BERT
#33LTG-BERT-small 24M
78.8
Mismatched· 2023-11-03
Not all layers are equally as important: Every Layer Counts BERT
#34FNet-Large
76
Mismatched· 2021-05-09
FNet: Mixing Tokens with Fourier Transforms Code
#35aESIM
73.9
Mismatched· 2018-12-05
Attention Boosted Sequential Inference Model
#36Stacked Bi-LSTMs (shortcut connections, max-pooling)
72.2
Mismatched· 2018-11-02
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News Code
#37Multi-task BiLSTM + AttnSOTA
72.1
Mismatched· 2018-04-20
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Code
#38T5-Large 738M
72
Mismatched· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#39GenSenSOTA
71.3
Mismatched· 2018-03-30
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Code
#40Bi-LSTM sentence encoder (max-pooling)
71.1
Mismatched· 2018-11-02
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News Code
#41Stacked Bi-LSTMs (shortcut connections, max-pooling, attention)
70.5
Mismatched· 2018-11-02
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News Code
#42LaMini-GPT 1.5B
69.3
Mismatched· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#43SWEM-max
67.7
Mismatched· 2018-05-24
Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms Code
#44LaMini-F-T5 783M
61
Mismatched· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#45LaMini-T5 738M
55.8
Mismatched· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code
#46GPT-2-XL 1.5B
37
Mismatched· 2023-04-27
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions Code