Metric: 1 Image, 4*4 Stitching, Exact Accuracy (higher is better)
| # | Model↕ | 1 Image, 4*4 Stitching, Exact Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | GPT-4o | 83 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 2 | GPT-4V | 54.72 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 3 | Gemini Pro 1.5 | 39.85 | No | Gemini 1.5: Unlocking multimodal understanding a... | 2024-03-08 | Code |
| 4 | Gemini Pro 1.0 | 24.78 | No | Gemini: A Family of Highly Capable Multimodal Mo... | 2023-12-19 | Code |
| 5 | LLaVA-Llama-3 | 17.5 | No | LLaVA-UHD: an LMM Perceiving Any Aspect Ratio an... | 2024-03-18 | Code |
| 6 | Claude 3 Opus | 12.3 | No | - | - | - |
| 7 | IDEFICS2-8B | 7.8 | No | What matters when building vision-language models? | 2024-05-03 | - |
| 8 | InstructBLIP-Flan-T5-XXL | 6.2 | No | InstructBLIP: Towards General-purpose Vision-Lan... | 2023-05-11 | Code |
| 9 | CogVLM2-Llama-3 | 0.9 | No | CogVLM: Visual Expert for Pretrained Language Mo... | 2023-11-06 | Code |
| 10 | mPLUG-Owl-v2 | 0.3 | No | mPLUG-Owl2: Revolutionizing Multi-modal Large La... | 2023-11-07 | Code |
| 11 | CogVLM-17B | 0.1 | No | CogVLM: Visual Expert for Pretrained Language Mo... | 2023-11-06 | Code |
| 12 | InstructBLIP-Vicuna-13B | 0 | No | - | - | Code |