Metric: Accuracy (%) (higher is better)
| # | Model↕ | Accuracy (%)▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Video-RAG (based on LLaVA-Video) | 77.4 | No | Video-RAG: Visually-aligned Retrieval-Augmented ... | 2024-11-20 | Code |
| 2 | Gemini 1.5 Pro | 71.9 | No | Gemini 1.5: Unlocking multimodal understanding a... | 2024-03-08 | Code |
| 3 | GPT-4o | 70.3 | No | GPT-4o: Visual perception performance of multimo... | 2024-06-14 | - |
| 4 | Gemini 1.5 Flash | 66.3 | No | Gemini 1.5: Unlocking multimodal understanding a... | 2024-03-08 | Code |
| 5 | LLaVA-OneVision (72B) | 64.8 | No | - | - | - |
| 6 | GPT-4o mini | 62.3 | No | GPT-4o: Visual perception performance of multimo... | 2024-06-14 | - |
| 7 | VILA-1.5 (34B) | 61.4 | No | VILA: On Pre-training for Visual Language Models | 2023-12-12 | Code |
| 8 | VideoLLaMA2 (72B) | 60.9 | No | VideoLLaMA 2: Advancing Spatial-Temporal Modelin... | 2024-06-11 | Code |
| 9 | VideoChat-T (7B) | 46.3 | No | TimeSuite: Improving MLLMs for Long Video Unders... | 2024-10-25 | Code |