Metric: GPT-4 score (bbox) (higher is better)
| # | Model↕ | GPT-4 score (bbox)▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | GPT-4V-turbo-detail:high (Visual Prompt) | 60.7 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 2 | GPT-4V-turbo-detail:low (Visual Prompt) | 52.8 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 3 | LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt | 50.5 | Yes | Inst-IT: Boosting Multimodal Instance Understand... | 2024-12-04 | Code |
| 4 | ViP-LLaVA-13B (Visual Prompt) | 48.3 | No | Making Large Language Models Better Data Creators | 2023-10-31 | Code |
| 5 | LLaVA-1.5-13B (Coordinates) | 47.1 | No | Improved Baselines with Visual Instruction Tuning | 2023-10-05 | Code |
| 6 | Qwen-VL-Chat (Coordinates) | 45.3 | No | Qwen-VL: A Versatile Vision-Language Model for U... | 2023-08-24 | Code |
| 7 | LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt | 45.1 | Yes | Inst-IT: Boosting Multimodal Instance Understand... | 2024-12-04 | Code |
| 8 | LLaVA-1.5-13B (Visual Prompt) | 41.8 | No | Improved Baselines with Visual Instruction Tuning | 2023-10-05 | Code |
| 9 | Qwen-VL-Chat (Visual Prompt) | 39.2 | No | Qwen-VL: A Versatile Vision-Language Model for U... | 2023-08-24 | Code |
| 10 | InstructBLIP-13B (Visual Prompt) | 35.8 | No | InstructBLIP: Towards General-purpose Vision-Lan... | 2023-05-11 | Code |
| 11 | GPT4ROI 7B (ROI) | 35.1 | No | GPT4RoI: Instruction Tuning Large Language Model... | 2023-07-07 | Code |
| 12 | Shikra-7B (Coordinates) | 33.7 | No | Shikra: Unleashing Multimodal LLM's Referential ... | 2023-06-27 | Code |
| 13 | Kosmos-2 (Discrete Token) | 26.9 | No | Kosmos-2: Grounding Multimodal Large Language Mo... | 2023-06-26 | Code |