Visual Question Answering (VQA) on ViP-Bench

Metric: GPT-4 score (bbox) (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Hide extra data

#	Model↕	GPT-4 score (bbox)▼	Extra Data	Paper	Date↕	Code
1	GPT-4V-turbo-detail:high (Visual Prompt)	60.7	No	GPT-4 Technical Report	2023-03-15	Code
2	GPT-4V-turbo-detail:low (Visual Prompt)	52.8	No	GPT-4 Technical Report	2023-03-15	Code
3	LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt	50.5	Yes	Inst-IT: Boosting Multimodal Instance Understand...	2024-12-04	Code
4	ViP-LLaVA-13B (Visual Prompt)	48.3	No	Making Large Language Models Better Data Creators	2023-10-31	Code
5	LLaVA-1.5-13B (Coordinates)	47.1	No	Improved Baselines with Visual Instruction Tuning	2023-10-05	Code
6	Qwen-VL-Chat (Coordinates)	45.3	No	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
7	LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt	45.1	Yes	Inst-IT: Boosting Multimodal Instance Understand...	2024-12-04	Code
8	LLaVA-1.5-13B (Visual Prompt)	41.8	No	Improved Baselines with Visual Instruction Tuning	2023-10-05	Code
9	Qwen-VL-Chat (Visual Prompt)	39.2	No	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
10	InstructBLIP-13B (Visual Prompt)	35.8	No	InstructBLIP: Towards General-purpose Vision-Lan...	2023-05-11	Code
11	GPT4ROI 7B (ROI)	35.1	No	GPT4RoI: Instruction Tuning Large Language Model...	2023-07-07	Code
12	Shikra-7B (Coordinates)	33.7	No	Shikra: Unleashing Multimodal LLM's Referential ...	2023-06-27	Code
13	Kosmos-2 (Discrete Token)	26.9	No	Kosmos-2: Grounding Multimodal Large Language Mo...	2023-06-26	Code