Visual Question Answering on ViP-Bench

Metric: GPT-4 score (bbox) (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	GPT-4 score (bbox)▼	Extra Data	Paper	Date↕	Code
1	GPT-4V-turbo-detail:high (Visual Prompt)	60.7	No	GPT-4 Technical Report	2023-03-15	Code
2	GPT-4V-turbo-detail:low (Visual Prompt)	52.8	No	GPT-4 Technical Report	2023-03-15	Code
3	LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt	50.5	Yes	Inst-IT: Boosting Multimodal Instance Understand...	2024-12-04	Code
4	ViP-LLaVA-13B (Visual Prompt)	48.3	No	Making Large Language Models Better Data Creators	2023-10-31	Code
5	LLaVA-1.5-13B (Coordinates)	47.1	No	Improved Baselines with Visual Instruction Tuning	2023-10-05	Code
6	Qwen-VL-Chat (Coordinates)	45.3	No	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
7	LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt	45.1	Yes	Inst-IT: Boosting Multimodal Instance Understand...	2024-12-04	Code
8	LLaVA-1.5-13B (Visual Prompt)	41.8	No	Improved Baselines with Visual Instruction Tuning	2023-10-05	Code
9	Qwen-VL-Chat (Visual Prompt)	39.2	No	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
10	InstructBLIP-13B (Visual Prompt)	35.8	No	InstructBLIP: Towards General-purpose Vision-Lan...	2023-05-11	Code
11	GPT4ROI 7B (ROI)	35.1	No	GPT4RoI: Instruction Tuning Large Language Model...	2023-07-07	Code
12	Shikra-7B (Coordinates)	33.7	No	Shikra: Unleashing Multimodal LLM's Referential ...	2023-06-27	Code
13	Kosmos-2 (Discrete Token)	26.9	No	Kosmos-2: Grounding Multimodal Large Language Mo...	2023-06-26	Code

#1GPT-4V-turbo-detail:high (Visual Prompt)SOTA
60.7
GPT-4 score (bbox)· 2023-03-15
GPT-4 Technical Report Code
#2GPT-4V-turbo-detail:low (Visual Prompt)
52.8
GPT-4 score (bbox)· 2023-03-15
GPT-4 Technical Report Code
#3LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
50.5
GPT-4 score (bbox)· Extra Data· 2024-12-04
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Code
#4ViP-LLaVA-13B (Visual Prompt)
48.3
GPT-4 score (bbox)· 2023-10-31
Making Large Language Models Better Data Creators Code
#5LLaVA-1.5-13B (Coordinates)
47.1
GPT-4 score (bbox)· 2023-10-05
Improved Baselines with Visual Instruction Tuning Code
#6Qwen-VL-Chat (Coordinates)
45.3
GPT-4 score (bbox)· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code
#7LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
45.1
GPT-4 score (bbox)· Extra Data· 2024-12-04
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning Code
#8LLaVA-1.5-13B (Visual Prompt)
41.8
GPT-4 score (bbox)· 2023-10-05
Improved Baselines with Visual Instruction Tuning Code
#9Qwen-VL-Chat (Visual Prompt)
39.2
GPT-4 score (bbox)· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code
#10InstructBLIP-13B (Visual Prompt)
35.8
GPT-4 score (bbox)· 2023-05-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Code
#11GPT4ROI 7B (ROI)
35.1
GPT-4 score (bbox)· 2023-07-07
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest Code
#12Shikra-7B (Coordinates)
33.7
GPT-4 score (bbox)· 2023-06-27
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic Code
#13Kosmos-2 (Discrete Token)
26.9
GPT-4 score (bbox)· 2023-06-26
Kosmos-2: Grounding Multimodal Large Language Models to the World Code