Visual Question Answering (VQA) on InfiMM-Eval

Metric: Overall score (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Overall score▼	Extra Data	Paper	Date↕	Code
1	GPT-4V	74.44	No	GPT-4 Technical Report	2023-03-15	Code
2	SPHINX v2	39.48	No	SPHINX: The Joint Mixing of Weights, Tasks, and ...	2023-11-13	Code
3	Qwen-VL-Chat	37.39	No	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
4	CogVLM-Chat	37.16	No	CogVLM: Visual Expert for Pretrained Language Mo...	2023-11-06	Code
5	LLaVA-1.5	32.62	No	Improved Baselines with Visual Instruction Tuning	2023-10-05	Code
6	LLaMA-Adapter V2	30.46	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
7	Emu	28.24	No	Emu: Generative Pretraining in Multimodality	2023-07-11	Code
8	InstructBLIP	28.02	No	InstructBLIP: Towards General-purpose Vision-Lan...	2023-05-11	Code
9	InternLM-XComposer-VL	26.84	No	InternLM-XComposer: A Vision-Language Large Mode...	2023-09-26	Code
10	Otter	22.69	No	Otter: A Multi-Modal Model with In-Context Instr...	2023-05-05	Code
11	mPLUG-Owl2	20.05	No	mPLUG-Owl2: Revolutionizing Multi-modal Large La...	2023-11-07	Code
12	BLIP-2-OPT2.7B	19.31	No	BLIP-2: Bootstrapping Language-Image Pre-trainin...	2023-01-30	Code
13	MiniGPT-v2	10.43	No	MiniGPT-4: Enhancing Vision-Language Understandi...	2023-04-20	Code
14	OpenFlamingo-v2	6.82	No	OpenFlamingo: An Open-Source Framework for Train...	2023-08-02	Code

#1GPT-4VSOTA
74.44
Overall score· 2023-03-15
GPT-4 Technical Report Code
#2SPHINX v2
39.48
Overall score· 2023-11-13
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Code
#3Qwen-VL-Chat
37.39
Overall score· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code
#4CogVLM-Chat
37.16
Overall score· 2023-11-06
CogVLM: Visual Expert for Pretrained Language Models Code
#5LLaVA-1.5
32.62
Overall score· 2023-10-05
Improved Baselines with Visual Instruction Tuning Code
#6LLaMA-Adapter V2
30.46
Overall score· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#7Emu
28.24
Overall score· 2023-07-11
Emu: Generative Pretraining in Multimodality Code
#8InstructBLIP
28.02
Overall score· 2023-05-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Code
#9InternLM-XComposer-VL
26.84
Overall score· 2023-09-26
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition Code
#10Otter
22.69
Overall score· 2023-05-05
Otter: A Multi-Modal Model with In-Context Instruction Tuning Code
#11mPLUG-Owl2
20.05
Overall score· 2023-11-07
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Code
#12BLIP-2-OPT2.7BSOTA
19.31
Overall score· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Code
#13MiniGPT-v2
10.43
Overall score· 2023-04-20
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Code
#14OpenFlamingo-v2
6.82
Overall score· 2023-08-02
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models Code