Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
OK-VQA
Visual Question Answering (VQA) on OK-VQA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
PaLI-X-VPD
66.8
No
Visual Program Distillation: Distilling Tools an...
2023-12-05
-
2
PaLM-E-562B
66.1
No
PaLM-E: An Embodied Multimodal Language Model
2023-03-06
Code
3
PaLI-X (Single-task FT)
66.1
No
PaLI-X: On Scaling up a Multilingual Vision and ...
2023-05-29
Code
4
PaLI 17B
64.5
No
PaLI: A Jointly-Scaled Multilingual Language-Ima...
2022-09-14
Code
5
Prophet
62.5
No
Prophet: Prompting Large Language Models with Co...
2023-03-03
Code
6
RA-VQA-v2 (BLIP 2)
62.08
No
Fine-grained Late-interaction Multi-modal Retrie...
2023-09-29
Code
7
A Simple Baseline for KB-VQA
61.2
No
A Simple Baseline for Knowledge-Based Visual Que...
2023-10-20
-
8
PromptCap
60.4
No
PromptCap: Prompt-Guided Task-Aware Image Captio...
2022-11-15
Code
9
ReVeaL WIT + CC12M + Wikidata + VQA-2
59.1
No
REVEAL: Retrieval-Augmented Visual-Language Pre-...
2022-12-10
Code
10
Lyrics
58.2
No
Lyrics: Boosting Fine-grained Language-Vision Al...
2023-12-08
-
11
REVIVE (Ensemble)
58
No
REVIVE: Regional Visual Representation Matters i...
2022-06-02
Code
12
REVIVE (Single)
56.6
No
REVIVE: Regional Visual Representation Matters i...
2022-06-02
Code
13
RA-VQA-v2 (T5-large)
54.85
No
Fine-grained Late-interaction Multi-modal Retrie...
2023-09-29
Code
14
RA-VQA (T5-large)
54.48
No
Retrieval Augmented Visual Question Answering wi...
2022-10-07
Code
15
VK-OOD
52.4
No
-
-
Code
16
VK-OOD
52.4
No
-
-
Code
17
RA-VQA-FrDPR (T5-large)
51.22
No
Retrieval Augmented Visual Question Answering wi...
2022-10-07
Code
18
Flamingo80B
50.6
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
19
TRiG (T5-Large)
50.5
No
-
-
-
20
HYDRA
48.6
No
HYDRA: A Hyper Agent for Dynamic Compositional V...
2024-03-19
Code
21
PICa
48
Yes
An Empirical Study of GPT-3 for Few-Shot Knowled...
2021-09-10
Code
22
LaKo
47.01
No
LaKo: Knowledge-driven Visual Question Answering...
2022-07-26
Code
23
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
45.9
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
24
Flamingo9B
44.7
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
25
VLC-BERT
43.1
No
VLC-BERT: Visual Question Answering with Context...
2022-10-24
Code
26
T5(Tan and Bansal, 2019) + Prefixes
42.03
No
LaKo: Knowledge-driven Visual Question Answering...
2022-07-26
Code
27
Flamingo3B
41.2
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
28
BLIP-2 ViT-G FlanT5 XL (zero-shot)
40.7
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
29
BLIP-2 ViT-L FlanT5 XL (zero-shot)
39.4
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
30
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
31
PNP-VQA
35.9
No
Plug-and-Play VQA: Zero-shot VQA by Conjoining L...
2022-10-17
Code
32
BLIP-2 ViT-G OPT 2.7B (zero-shot)
31.7
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
33
BLIP-2 ViT-L OPT 2.7B (zero-shot)
30.2
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
34
FewVLM
16.5
No
A Good Prompt Is Worth Millions of Parameters: L...
2021-10-16
Code
35
MetaLM
11.4
No
Language Models are General-Purpose Interfaces
2022-06-13
Code
36
VLKD(ViT-B/16)
10.5
No
-
-
-
37
Frozen
5.9
No
Multimodal Few-Shot Learning with Frozen Languag...
2021-06-25
-
#1
PaLI-X-VPD
SOTA
66.8
Accuracy
· 2023-12-05
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
#2
PaLM-E-562B
SOTA
66.1
Accuracy
· 2023-03-06
PaLM-E: An Embodied Multimodal Language Model
Code
#3
PaLI-X (Single-task FT)
66.1
Accuracy
· 2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Code
#4
PaLI 17B
SOTA
64.5
Accuracy
· 2022-09-14
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Code
#5
Prophet
62.5
Accuracy
· 2023-03-03
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Code
#6
RA-VQA-v2 (BLIP 2)
62.08
Accuracy
· 2023-09-29
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
Code
#7
A Simple Baseline for KB-VQA
61.2
Accuracy
· 2023-10-20
A Simple Baseline for Knowledge-Based Visual Question Answering
#8
PromptCap
60.4
Accuracy
· 2022-11-15
PromptCap: Prompt-Guided Task-Aware Image Captioning
Code
#9
ReVeaL WIT + CC12M + Wikidata + VQA-2
59.1
Accuracy
· 2022-12-10
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
Code
#10
Lyrics
58.2
Accuracy
· 2023-12-08
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
#11
REVIVE (Ensemble)
SOTA
58
Accuracy
· 2022-06-02
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
Code
#12
REVIVE (Single)
56.6
Accuracy
· 2022-06-02
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
Code
#13
RA-VQA-v2 (T5-large)
54.85
Accuracy
· 2023-09-29
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
Code
#14
RA-VQA (T5-large)
54.48
Accuracy
· 2022-10-07
Retrieval Augmented Visual Question Answering with Outside Knowledge
Code
#15
VK-OOD
52.4
Accuracy
No paper
Code
#16
VK-OOD
52.4
Accuracy
No paper
Code
#17
RA-VQA-FrDPR (T5-large)
51.22
Accuracy
· 2022-10-07
Retrieval Augmented Visual Question Answering with Outside Knowledge
Code
#18
Flamingo80B
SOTA
50.6
Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#19
TRiG (T5-Large)
50.5
Accuracy
No paper
#20
HYDRA
48.6
Accuracy
· 2024-03-19
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Code
#21
PICa
SOTA
48
Accuracy
· Extra Data
· 2021-09-10
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Code
#22
LaKo
47.01
Accuracy
· 2022-07-26
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
Code
#23
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
45.9
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#24
Flamingo9B
44.7
Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#25
VLC-BERT
43.1
Accuracy
· 2022-10-24
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
Code
#26
T5(Tan and Bansal, 2019) + Prefixes
42.03
Accuracy
· 2022-07-26
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
Code
#27
Flamingo3B
41.2
Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#28
BLIP-2 ViT-G FlanT5 XL (zero-shot)
40.7
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#29
BLIP-2 ViT-L FlanT5 XL (zero-shot)
39.4
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#30
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#31
PNP-VQA
35.9
Accuracy
· 2022-10-17
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
Code
#32
BLIP-2 ViT-G OPT 2.7B (zero-shot)
31.7
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#33
BLIP-2 ViT-L OPT 2.7B (zero-shot)
30.2
Accuracy
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#34
FewVLM
16.5
Accuracy
· 2021-10-16
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
Code
#35
MetaLM
11.4
Accuracy
· 2022-06-13
Language Models are General-Purpose Interfaces
Code
#36
VLKD(ViT-B/16)
10.5
Accuracy
No paper
#37
Frozen
SOTA
5.9
Accuracy
· 2021-06-25
Multimodal Few-Shot Learning with Frozen Language Models