Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering
/
ViP-Bench
Visual Question Answering on ViP-Bench
Metric: GPT-4 score (bbox) (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
GPT-4 score (bbox) (best first)
GPT-4 score (bbox) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
GPT-4 score (bbox)
▼
Extra Data
Paper
Date
↕
Code
1
GPT-4V-turbo-detail:high (Visual Prompt)
60.7
No
GPT-4 Technical Report
2023-03-15
Code
2
GPT-4V-turbo-detail:low (Visual Prompt)
52.8
No
GPT-4 Technical Report
2023-03-15
Code
3
LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
50.5
Yes
Inst-IT: Boosting Multimodal Instance Understand...
2024-12-04
Code
4
ViP-LLaVA-13B (Visual Prompt)
48.3
No
Making Large Language Models Better Data Creators
2023-10-31
Code
5
LLaVA-1.5-13B (Coordinates)
47.1
No
Improved Baselines with Visual Instruction Tuning
2023-10-05
Code
6
Qwen-VL-Chat (Coordinates)
45.3
No
Qwen-VL: A Versatile Vision-Language Model for U...
2023-08-24
Code
7
LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
45.1
Yes
Inst-IT: Boosting Multimodal Instance Understand...
2024-12-04
Code
8
LLaVA-1.5-13B (Visual Prompt)
41.8
No
Improved Baselines with Visual Instruction Tuning
2023-10-05
Code
9
Qwen-VL-Chat (Visual Prompt)
39.2
No
Qwen-VL: A Versatile Vision-Language Model for U...
2023-08-24
Code
10
InstructBLIP-13B (Visual Prompt)
35.8
No
InstructBLIP: Towards General-purpose Vision-Lan...
2023-05-11
Code
11
GPT4ROI 7B (ROI)
35.1
No
GPT4RoI: Instruction Tuning Large Language Model...
2023-07-07
Code
12
Shikra-7B (Coordinates)
33.7
No
Shikra: Unleashing Multimodal LLM's Referential ...
2023-06-27
Code
13
Kosmos-2 (Discrete Token)
26.9
No
Kosmos-2: Grounding Multimodal Large Language Mo...
2023-06-26
Code
#1
GPT-4V-turbo-detail:high (Visual Prompt)
SOTA
60.7
GPT-4 score (bbox)
· 2023-03-15
GPT-4 Technical Report
Code
#2
GPT-4V-turbo-detail:low (Visual Prompt)
52.8
GPT-4 score (bbox)
· 2023-03-15
GPT-4 Technical Report
Code
#3
LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
50.5
GPT-4 score (bbox)
· Extra Data
· 2024-12-04
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Code
#4
ViP-LLaVA-13B (Visual Prompt)
48.3
GPT-4 score (bbox)
· 2023-10-31
Making Large Language Models Better Data Creators
Code
#5
LLaVA-1.5-13B (Coordinates)
47.1
GPT-4 score (bbox)
· 2023-10-05
Improved Baselines with Visual Instruction Tuning
Code
#6
Qwen-VL-Chat (Coordinates)
45.3
GPT-4 score (bbox)
· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Code
#7
LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
45.1
GPT-4 score (bbox)
· Extra Data
· 2024-12-04
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Code
#8
LLaVA-1.5-13B (Visual Prompt)
41.8
GPT-4 score (bbox)
· 2023-10-05
Improved Baselines with Visual Instruction Tuning
Code
#9
Qwen-VL-Chat (Visual Prompt)
39.2
GPT-4 score (bbox)
· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Code
#10
InstructBLIP-13B (Visual Prompt)
35.8
GPT-4 score (bbox)
· 2023-05-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Code
#11
GPT4ROI 7B (ROI)
35.1
GPT-4 score (bbox)
· 2023-07-07
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Code
#12
Shikra-7B (Coordinates)
33.7
GPT-4 score (bbox)
· 2023-06-27
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Code
#13
Kosmos-2 (Discrete Token)
26.9
GPT-4 score (bbox)
· 2023-06-26
Kosmos-2: Grounding Multimodal Large Language Models to the World
Code