Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
GQA test-dev
Visual Question Answering (VQA) on GQA test-dev
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
CFR
72.1
No
Coarse-to-Fine Reasoning for Visual Question Ans...
2021-10-06
Code
2
PaLI-X-VPD
67.3
No
Visual Program Distillation: Distilling Tools an...
2023-12-05
-
3
CuMo-7B
64.9
Yes
CuMo: Scaling Multimodal LLM with Co-Upcycled Mi...
2024-05-09
Code
4
Video-LaVIT
64.4
No
Video-LaVIT: Unified Video-Language Pre-training...
2024-02-05
Code
5
NSM
62.95
No
Learning by Abstraction: The Neural State Machine
2019-07-09
Code
6
Lyrics
62.4
No
Lyrics: Boosting Fine-grained Language-Vision Al...
2023-12-08
-
7
LXMERT (Pre-train + scratch)
60
No
LXMERT: Learning Cross-Modality Encoder Represen...
2019-08-20
Code
8
single-hop + LCGN (ours)
55.8
No
Language-Conditioned Graph Networks for Relation...
2019-05-10
Code
9
HYDRA
47.9
No
HYDRA: A Hyper Agent for Dynamic Compositional V...
2024-03-19
Code
10
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
44.7
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
11
BLIP-2 ViT-L FlanT5 XL (zero-shot)
44.4
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
12
BLIP-2 ViT-G FlanT5 XL (zero-shot)
44.2
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
13
PNP-VQA
41.9
No
Plug-and-Play VQA: Zero-shot VQA by Conjoining L...
2022-10-17
Code
14
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
15
BLIP-2 ViT-G OPT 2.7B (zero-shot)
34.6
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
16
BLIP-2 ViT-L OPT 2.7B (zero-shot)
33.9
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
17
FewVLM (zero-shot)
29.3
No
A Good Prompt Is Worth Millions of Parameters: L...
2021-10-16
Code
#1
CFR
SOTA
72.1
Accuracy
· 2021-10-06
Coarse-to-Fine Reasoning for Visual Question Answering
Code
#2
PaLI-X-VPD
67.3
Accuracy
· 2023-12-05
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
#3
CuMo-7B
64.9
Accuracy
· Extra Data
· 2024-05-09
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Code
#4
Video-LaVIT
64.4
Accuracy
· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Code
#5
NSM
SOTA
62.95
Accuracy
· 2019-07-09
Learning by Abstraction: The Neural State Machine
Code
#6
Lyrics
62.4
Accuracy
· 2023-12-08
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
#7
LXMERT (Pre-train + scratch)
60
Accuracy
· 2019-08-20
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Code
#8
single-hop + LCGN (ours)
SOTA
55.8
Accuracy
· 2019-05-10
Language-Conditioned Graph Networks for Relational Reasoning
Code
#9
HYDRA
47.9
Accuracy
· 2024-03-19
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Code
#10
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
44.7
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#11
BLIP-2 ViT-L FlanT5 XL (zero-shot)
44.4
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#12
BLIP-2 ViT-G FlanT5 XL (zero-shot)
44.2
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#13
PNP-VQA
41.9
Accuracy
· 2022-10-17
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
Code
#14
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#15
BLIP-2 ViT-G OPT 2.7B (zero-shot)
34.6
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#16
BLIP-2 ViT-L OPT 2.7B (zero-shot)
33.9
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#17
FewVLM (zero-shot)
29.3
Accuracy
· 2021-10-16
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
Code