Visual Question Answering (VQA) on ViP-Bench

Metric: GPT-4 score (human) (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Hide extra data

Sort:

#	Model↕	GPT-4 score (human)▼	Extra Data	Paper	Date↕	Code
1	GPT-4V-turbo-detail:high (Visual Prompt)	59.9	No	GPT-4 Technical Report	2023-03-15	Code
2	GPT-4V-turbo-detail:low (Visual Prompt)	51.4	No	GPT-4 Technical Report	2023-03-15	Code
3	LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt	49	Yes	Inst-IT: Boosting Multimodal Instance Understand...	2024-12-04	Code
4	ViP-LLaVA-13B (Visual Prompt)	48.2	No	Making Large Language Models Better Data Creators	2023-10-31	Code
5	LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt	48.2	Yes	Inst-IT: Boosting Multimodal Instance Understand...	2024-12-04	Code
6	LLaVA-1.5-13B (Visual Prompt)	42.9	No	Improved Baselines with Visual Instruction Tuning	2023-10-05	Code
7	Qwen-VL-Chat (Visual Prompt)	41.7	No	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
8	InstructBLIP-13B (Visual Prompt)	35.2	No	InstructBLIP: Towards General-purpose Vision-Lan...	2023-05-11	Code

#1GPT-4V-turbo-detail:high (Visual Prompt)SOTA
59.9
GPT-4 score (human)· 2023-03-15
GPT-4 Technical Report Code
#2GPT-4V-turbo-detail:low (Visual Prompt)
51.4
GPT-4 score (human)· 2023-03-15
GPT-4 Technical Report Code
#3LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
49
GPT-4 score (human)· Extra Data· 2024-12-04
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Code
#4ViP-LLaVA-13B (Visual Prompt)
48.2
GPT-4 score (human)· 2023-10-31
Making Large Language Models Better Data Creators Code
#5LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
48.2
GPT-4 score (human)· Extra Data· 2024-12-04
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Code
#6LLaVA-1.5-13B (Visual Prompt)
42.9
GPT-4 score (human)· 2023-10-05
Improved Baselines with Visual Instruction Tuning Code
#7Qwen-VL-Chat (Visual Prompt)
41.7
GPT-4 score (human)· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code
#8InstructBLIP-13B (Visual Prompt)
35.2
GPT-4 score (human)· 2023-05-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Code