Visual Question Answering (VQA) on VQA v2 test-std

Metric: overall (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	overall▼	Extra Data	Paper	Date↕	Code
1	BEiT-3	84.03	No	Image as a Foreign Language: BEiT Pretraining fo...	2022-08-22	Code
2	mPLUG-Huge	83.62	No	mPLUG: Effective and Efficient Vision-Language L...	2022-05-24	Code
3	ONE-PEACE	82.52	No	ONE-PEACE: Exploring One General Representation ...	2023-05-18	Code
4	OFA	81.98	No	OFA: Unifying Architectures, Tasks, and Modaliti...	2022-02-07	Code
5	X2-VLM (large)	81.8	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
6	VLMo	81.3	No	VLMo: Unified Vision-Language Pre-Training with ...	2021-11-03	Code
7	Florence	80.36	No	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
8	SimVLM	80.34	No	SimVLM: Simple Visual Language Model Pretraining...	2021-08-24	Code
9	X2-VLM (base)	80.2	No	X$^2$-VLM: All-In-One Pre-trained Model For Visi...	2022-11-22	Code
10	VAST	80.19	Yes	-	-	-
11	VALOR	78.62	Yes	VALOR: Vision-Audio-Language Omni-Perception Pre...	2023-04-17	Code
12	Prompt Tuning	78.53	No	Prompt Tuning for Generative Multimodal Pretrain...	2022-08-04	Code
13	Prismer	78.49	No	Prismer: A Vision-Language Model with Multi-Task...	2023-03-04	Code
14	MSR + MS Cog. Svcs., X10 models	77.45	No	VinVL: Revisiting Visual Representations in Visi...	2021-01-02	Code
15	MSR + MS Cog. Svcs.	76.63	No	VinVL: Revisiting Visual Representations in Visi...	2021-01-02	Code
16	ALBEF (14M)	76.04	No	Align before Fuse: Vision and Language Represent...	2021-07-16	Code
17	BGN, ensemble	75.92	No	Bilinear Graph Networks for Visual Question Answ...	2019-07-23	-
18	ERNIE-ViL-single model	74.93	No	ERNIE-ViL: Knowledge Enhanced Vision-Language Re...	2020-06-30	-
19	Single, w/o VLP	74.16	No	In Defense of Grid Features for Visual Question ...	2020-01-10	Code
20	Single, w/o VLP	73.86	No	Deep Multimodal Neural Architecture Search	2020-04-25	Code
21	UNITER (Large)	73.4	No	UNITER: UNiversal Image-TExt Representation Lear...	2019-09-25	Code
22	X-101 grid features + MCAN	72.71	No	In Defense of Grid Features for Visual Question ...	2020-01-10	Code
23	LXMERT	72.5	No	LXMERT: Learning Cross-Modality Encoder Represen...	2019-08-20	Code
24	VL-BERTLARGE	72.2	No	VL-BERT: Pre-training of Generic Visual-Linguist...	2019-08-22	Code
25	MCAN+VC	71.49	No	Visual Commonsense R-CNN	2020-02-27	Code
26	VisualBERT	71	No	VisualBERT: A Simple and Performant Baseline for...	2019-08-09	Code
27	MCANed-6	70.9	No	Deep Modular Co-Attention Networks for Visual Qu...	2019-06-25	Code
28	Unified VLP	70.7	No	Unified Vision-Language Pre-Training for Image C...	2019-09-24	Code
29	BAN+Glove+Counter	70.4	No	Bilinear Attention Networks	2018-05-21	Code
30	Up-Down	70.34	No	Bottom-Up and Top-Down Attention for Image Capti...	2017-07-25	Code
31	Image features from bottom-up attention (adaptive K, ensemble)	70.3	No	Tips and Tricks for Visual Question Answering: L...	2017-08-09	Code
32	Caption VQA	69.7	No	Generating Question Relevant Captions to Aid Vis...	2019-06-03	-
33	MuRel	68.4	No	MUREL: Multimodal Relational Reasoning for Visua...	2019-02-25	Code
34	DMN	68.4	No	Learning to Count Objects in Natural Images for ...	2018-02-15	Code
35	BLOCK	67.9	No	BLOCK: Bilinear Superdiagonal Fusion for Visual ...	2019-01-31	Code
36	MUTAN	67.4	No	MUTAN: Multimodal Tucker Fusion for Visual Quest...	2017-05-18	Code
37	2D continuous softmax	66.27	No	Sparse and Continuous Attention Mechanisms	2020-06-12	Code
38	MCB [11, 12]	62.27	No	Making the V in VQA Matter: Elevating the Role o...	2016-12-02	Code
39	Language-only	44.26	No	Making the V in VQA Matter: Elevating the Role o...	2016-12-02	Code
40	Prior	25.98	No	Making the V in VQA Matter: Elevating the Role o...	2016-12-02	Code

#1BEiT-3SOTA
84.03
overall· 2022-08-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks Code
#2mPLUG-HugeSOTA
83.62
overall· 2022-05-24
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections Code
#3ONE-PEACE
82.52
overall· 2023-05-18
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities Code
#4OFASOTA
81.98
overall· 2022-02-07
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework Code
#5X2-VLM (large)
81.8
overall· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#6VLMoSOTA
81.3
overall· 2021-11-03
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts Code
#7Florence
80.36
overall· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#8SimVLMSOTA
80.34
overall· 2021-08-24
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision Code
#9X2-VLM (base)
80.2
overall· 2022-11-22
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks Code
#10VAST
80.19
overall· Extra Data
No paper
#11VALOR
78.62
overall· Extra Data· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Code
#12Prompt Tuning
78.53
overall· 2022-08-04
Prompt Tuning for Generative Multimodal Pretrained Models Code
#13Prismer
78.49
overall· 2023-03-04
Prismer: A Vision-Language Model with Multi-Task Experts Code
#14MSR + MS Cog. Svcs., X10 modelsSOTA
77.45
overall· 2021-01-02
VinVL: Revisiting Visual Representations in Vision-Language Models Code
#15MSR + MS Cog. Svcs.
76.63
overall· 2021-01-02
VinVL: Revisiting Visual Representations in Vision-Language Models Code
#16ALBEF (14M)
76.04
overall· 2021-07-16
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Code
#17BGN, ensembleSOTA
75.92
overall· 2019-07-23
Bilinear Graph Networks for Visual Question Answering
#18ERNIE-ViL-single model
74.93
overall· 2020-06-30
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
#19Single, w/o VLP
74.16
overall· 2020-01-10
In Defense of Grid Features for Visual Question Answering Code
#20Single, w/o VLP
73.86
overall· 2020-04-25
Deep Multimodal Neural Architecture Search Code
#21UNITER (Large)
73.4
overall· 2019-09-25
UNITER: UNiversal Image-TExt Representation Learning Code
#22X-101 grid features + MCAN
72.71
overall· 2020-01-10
In Defense of Grid Features for Visual Question Answering Code
#23LXMERT
72.5
overall· 2019-08-20
LXMERT: Learning Cross-Modality Encoder Representations from Transformers Code
#24VL-BERTLARGE
72.2
overall· 2019-08-22
VL-BERT: Pre-training of Generic Visual-Linguistic Representations Code
#25MCAN+VC
71.49
overall· 2020-02-27
Visual Commonsense R-CNN Code
#26VisualBERT
71
overall· 2019-08-09
VisualBERT: A Simple and Performant Baseline for Vision and Language Code
#27MCANed-6SOTA
70.9
overall· 2019-06-25
Deep Modular Co-Attention Networks for Visual Question Answering Code
#28Unified VLP
70.7
overall· 2019-09-24
Unified Vision-Language Pre-Training for Image Captioning and VQA Code
#29BAN+Glove+CounterSOTA
70.4
overall· 2018-05-21
Bilinear Attention Networks Code
#30Up-DownSOTA
70.34
overall· 2017-07-25
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Code
#31Image features from bottom-up attention (adaptive K, ensemble)
70.3
overall· 2017-08-09
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge Code
#32Caption VQA
69.7
overall· 2019-06-03
Generating Question Relevant Captions to Aid Visual Question Answering
#33MuRel
68.4
overall· 2019-02-25
MUREL: Multimodal Relational Reasoning for Visual Question Answering Code
#34DMN
68.4
overall· 2018-02-15
Learning to Count Objects in Natural Images for Visual Question Answering Code
#35BLOCK
67.9
overall· 2019-01-31
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection Code
#36MUTANSOTA
67.4
overall· 2017-05-18
MUTAN: Multimodal Tucker Fusion for Visual Question Answering Code
#372D continuous softmax
66.27
overall· 2020-06-12
Sparse and Continuous Attention Mechanisms Code
#38MCB [11, 12]SOTA
62.27
overall· 2016-12-02
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Code
#39Language-only
44.26
overall· 2016-12-02
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Code
#40Prior
25.98
overall· 2016-12-02
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Code