Metric: Accuracy (%) (higher is better)
| # | Model↕ | Accuracy (%)▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Gemini 1.5 Pro | 81.3 | No | Gemini 1.5: Unlocking multimodal understanding a... | 2024-03-08 | Code |
| 2 | Video-RAG (Based on LLaVA-Video) | 77.4 | No | Video-RAG: Visually-aligned Retrieval-Augmented ... | 2024-11-20 | Code |
| 3 | GPT-4o | 77.2 | No | GPT-4o: Visual perception performance of multimo... | 2024-06-14 | - |
| 4 | Gemini 1.5 Flash | 75 | No | Gemini 1.5: Unlocking multimodal understanding a... | 2024-03-08 | Code |
| 5 | GPT-4o mini | 68.9 | No | GPT-4o: Visual perception performance of multimo... | 2024-06-14 | - |
| 6 | BIMBA-LLaVA-Qwen2-7B | 64.67 | No | BIMBA: Selective-Scan Compression for Long-Range... | 2025-03-12 | Code |
| 7 | VILA-1.5 (34B) | 64.1 | No | VILA: On Pre-training for Visual Language Models | 2023-12-12 | Code |
| 8 | MiniCPM-V 2.6 (8B) | 63.7 | No | MiniCPM-V: A GPT-4V Level MLLM on Your Phone | 2024-08-03 | Code |
| 9 | VideoLLaMA2 (72B) | 63.1 | No | VideoLLaMA 2: Advancing Spatial-Temporal Modelin... | 2024-06-11 | Code |
| 10 | LongVU (7B) | 60.6 | No | LongVU: Spatiotemporal Adaptive Compression for ... | 2024-10-22 | Code |
| 11 | VideoChat-T (7B) | 55.8 | No | TimeSuite: Improving MLLMs for Long Video Unders... | 2024-10-25 | Code |