Metric: 1 Image, 8*8 Stitching, Exact Accuracy (higher is better)
| # | Model↕ | 1 Image, 8*8 Stitching, Exact Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Gemini Pro 1.5 | 29.81 | No | Gemini 1.5: Unlocking multimodal understanding a... | 2024-03-08 | Code |
| 2 | GPT-4o | 19 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 3 | GPT-4V | 7.3 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 4 | LLaVA-Llama-3 | 3.3 | No | LLaVA-UHD: an LMM Perceiving Any Aspect Ratio an... | 2024-03-18 | Code |
| 5 | InstructBLIP-Flan-T5-XXL | 2.2 | No | InstructBLIP: Towards General-purpose Vision-Lan... | 2023-05-11 | Code |
| 6 | Gemini Pro 1.0 | 2.11 | No | Gemini: A Family of Highly Capable Multimodal Mo... | 2023-12-19 | Code |
| 7 | Claude 3 Opus | 1.6 | No | - | - | - |
| 8 | IDEFICS2-8B | 0.9 | No | What matters when building vision-language models? | 2024-05-03 | - |
| 9 | mPLUG-Owl-v2 | 0.7 | No | mPLUG-Owl2: Revolutionizing Multi-modal Large La... | 2023-11-07 | Code |
| 10 | CogVLM-17B | 0.3 | No | CogVLM: Visual Expert for Pretrained Language Mo... | 2023-11-06 | Code |
| 11 | CogVLM2-Llama-3 | 0.1 | No | CogVLM: Visual Expert for Pretrained Language Mo... | 2023-11-06 | Code |
| 12 | InstructBLIP-Vicuna-13B | 0 | No | - | - | Code |