Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
InfiMM-Eval
Visual Question Answering (VQA) on InfiMM-Eval
Metric: Deductive (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Deductive (best first)
Deductive (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Deductive
▼
Extra Data
Paper
Date
↕
Code
1
GPT-4V
74.86
No
GPT-4 Technical Report
2023-03-15
Code
2
SPHINX v2
42.17
No
SPHINX: The Joint Mixing of Weights, Tasks, and ...
2023-11-13
Code
3
Qwen-VL-Chat
37.55
No
Qwen-VL: A Versatile Vision-Language Model for U...
2023-08-24
Code
4
CogVLM-Chat
36.75
No
CogVLM: Visual Expert for Pretrained Language Mo...
2023-11-06
Code
5
LLaVA-1.5
30.94
No
Improved Baselines with Visual Instruction Tuning
2023-10-05
Code
6
Emu
28.9
No
Emu: Generative Pretraining in Multimodality
2023-07-11
Code
7
LLaMA-Adapter V2
28.7
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
8
InstructBLIP
27.56
No
InstructBLIP: Towards General-purpose Vision-Lan...
2023-05-11
Code
9
InternLM-XComposer-VL
26.77
No
InternLM-XComposer: A Vision-Language Large Mode...
2023-09-26
Code
10
mPLUG-Owl2
23.43
No
mPLUG-Owl2: Revolutionizing Multi-modal Large La...
2023-11-07
Code
11
Otter
22.49
No
Otter: A Multi-Modal Model with In-Context Instr...
2023-05-05
Code
12
MiniGPT-v2
11.02
No
MiniGPT-4: Enhancing Vision-Language Understandi...
2023-04-20
Code
13
OpenFlamingo-v2
8.88
No
OpenFlamingo: An Open-Source Framework for Train...
2023-08-02
Code
14
BLIP-2-OPT2.7B
2.76
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
#1
GPT-4V
SOTA
74.86
Deductive
· 2023-03-15
GPT-4 Technical Report
Code
#2
SPHINX v2
42.17
Deductive
· 2023-11-13
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Code
#3
Qwen-VL-Chat
37.55
Deductive
· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Code
#4
CogVLM-Chat
36.75
Deductive
· 2023-11-06
CogVLM: Visual Expert for Pretrained Language Models
Code
#5
LLaVA-1.5
30.94
Deductive
· 2023-10-05
Improved Baselines with Visual Instruction Tuning
Code
#6
Emu
28.9
Deductive
· 2023-07-11
Emu: Generative Pretraining in Multimodality
Code
#7
LLaMA-Adapter V2
28.7
Deductive
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#8
InstructBLIP
27.56
Deductive
· 2023-05-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Code
#9
InternLM-XComposer-VL
26.77
Deductive
· 2023-09-26
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Code
#10
mPLUG-Owl2
23.43
Deductive
· 2023-11-07
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Code
#11
Otter
22.49
Deductive
· 2023-05-05
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Code
#12
MiniGPT-v2
11.02
Deductive
· 2023-04-20
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Code
#13
OpenFlamingo-v2
8.88
Deductive
· 2023-08-02
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Code
#14
BLIP-2-OPT2.7B
SOTA
2.76
Deductive
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code