Metric: Accuracy (higher is better)
| # | Model↕ | Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | FrozenBiLM (with speech) | 59.7 | No | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 2 | IG-VLM (no speech, GPT-4V) | 57.8 | No | An Image Grid Can Be Worth a Video: Zero-shot Vi... | 2024-03-27 | Code |
| 3 | MiniGPT4-video-7B | 54.21 | No | MiniGPT4-Video: Advancing Multimodal LLMs for Vi... | 2024-04-04 | Code |
| 4 | VideoChat_HD_mistral (no speech) | 50.6 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 5 | VideoChat_mistral (no speech) | 46.4 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 6 | VideoChat2 (no speech) | 40.6 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 7 | SEVILA (no speech) | 38.2 | No | Self-Chained Image-Language Model for Video Loca... | 2023-05-11 | Code |
| 8 | InternVideo (no speech) | 35.9 | No | InternVideo: General Video Foundation Models via... | 2022-12-06 | Code |
| 9 | FrozenBILM (no speech) | 29.7 | No | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |