| 1 | LinVT-Qwen2-VL
(7B) | 85.5 | No | LinVT: Empower Your Image-level Large Language M... | 2024-12-06 | Code |
| 2 | InternVL-2.5(8B) | 85.5 | No | Expanding Performance Boundaries of Open-Source ... | 2024-12-06 | Code |
| 3 | VideoLLaMA3(7B) | 84.5 | No | VideoLLaMA 3: Frontier Multimodal Foundation Mod... | 2025-01-22 | Code |
| 4 | PLM-8B | 84.1 | No | PerceptionLM: Open-Access Data and Models for De... | 2025-04-17 | Code |
| 5 | BIMBA-LLaVA-Qwen2-7B | 83.73 | No | BIMBA: Selective-Scan Compression for Long-Range... | 2025-03-12 | Code |
| 6 | PLM-3B | 83.4 | No | PerceptionLM: Open-Access Data and Models for De... | 2025-04-17 | Code |
| 7 | LLaVA-Video | 83.2 | No | Video Instruction Tuning With Synthetic Data | 2024-10-03 | - |
| 8 | NVILA(8B) | 82.2 | No | NVILA: Efficient Frontier Visual Language Models | 2024-12-05 | Code |
| 9 | Oryx-1.5(7B) | 81.8 | No | Oryx MLLM: On-Demand Spatial-Temporal Understand... | 2024-09-19 | Code |
| 10 | Qwen2-VL(7B) | 81.2 | No | Qwen2-VL: Enhancing Vision-Language Model's Perc... | 2024-09-18 | Code |
| 11 | LongVILA(7B) | 80.7 | No | LongVILA: Scaling Long-Context Visual Language M... | 2024-08-19 | Code |
| 12 | PLM-1B | 80.3 | No | PerceptionLM: Open-Access Data and Models for De... | 2025-04-17 | Code |
| 13 | LLaVA-OV(72B) | 80.2 | No | LLaVA-OneVision: Easy Visual Task Transfer | 2024-08-06 | Code |
| 14 | VideoMultiAgent (GPT-4o) | 79.6 | No | VideoMultiAgents: A Multi-Agent Framework for Vi... | 2025-04-25 | Code |
| 15 | VideoChat2_HD_mistral | 79.5 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 16 | LLaVA-OV(7B) | 79.4 | No | LLaVA-OneVision: Easy Visual Task Transfer | 2024-08-06 | Code |
| 17 | Tarsier (34B) | 79.2 | No | Tarsier: Recipes for Training and Evaluating Lar... | 2024-06-30 | Code |
| 18 | LLaVA-NeXT-Interleave(14B) | 79.1 | No | LLaVA-NeXT-Interleave: Tackling Multi-image, Vid... | 2024-07-10 | Code |
| 19 | VideoChat2_mistral | 78.6 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 20 | mPLUG-Owl3(8B) | 78.6 | No | mPLUG-Owl3: Towards Long Image-Sequence Understa... | 2024-08-09 | Code |
| 21 | LLaVA-NeXT-Interleave(7B) | 78.2 | No | LLaVA-NeXT-Interleave: Tackling Multi-image, Vid... | 2024-07-10 | Code |
| 22 | AKEYS | 78.1 | No | Agentic Keyframe Search for Video Question Answe... | 2025-03-20 | Code |
| 23 | LLaVA-NeXT-Interleave(DPO) | 77.9 | No | LLaVA-NeXT-Interleave: Tackling Multi-image, Vid... | 2024-07-10 | Code |
| 24 | Vamos | 77.3 | No | Vamos: Versatile Action Models for Video Underst... | 2023-11-22 | Code |
| 25 | ViLA (3B) | 75.6 | No | ViLA: Efficient Video-Language Alignment for Vid... | 2023-12-13 | Code |
| 26 | VideoLLaMA2.1(7B) | 75.6 | No | VideoLLaMA 2: Advancing Spatial-Temporal Modelin... | 2024-06-11 | Code |
| 27 | LLaMA-VQA (33B) | 75.5 | No | Large Language Models are Temporal and Causal Re... | 2023-10-24 | Code |
| 28 | ENTER | 75.1 | No | ENTER: Event Based Interpretable Reasoning for V... | 2025-01-24 | - |
| 29 | ViLA (3B, 4 frames) | 74.4 | No | ViLA: Efficient Video-Language Alignment for Vid... | 2023-12-13 | Code |
| 30 | CREMA | 73.9 | No | CREMA: Generalizable and Efficient Video-Languag... | 2024-02-08 | Code |
| 31 | SeViLA | 73.8 | No | Self-Chained Image-Language Model for Video Loca... | 2023-05-11 | Code |
| 32 | TS-LLaVA-34B | 73.6 | No | TS-LLaVA: Constructing Visual Tokens through Thu... | 2024-11-17 | Code |
| 33 | TCR | 73.5 | No | Text-Conditioned Resampler For Long Form Video U... | 2023-12-19 | - |
| 34 | VideoTree (GPT4) | 73.5 | No | VideoTree: Adaptive Tree-based Video Representat... | 2024-05-29 | Code |
| 35 | LVNet(GPT-4o) | 72.9 | No | Too Many Frames, Not All Useful: Efficient Strat... | 2024-06-13 | Code |
| 36 | LSTP | 72.1 | No | Efficient Temporal Extrapolation of Multimodal L... | 2024-02-25 | Code |
| 37 | Mirasol3B | 72 | No | Mirasol3B: A Multimodal Autoregressive model for... | 2023-11-09 | - |
| 38 | VideoAgent (GPT-4) | 71.3 | No | VideoAgent: Long-form Video Understanding with L... | 2024-03-15 | Code |
| 39 | IG-VLM(LLaVA v1.6) | 70.9 | No | An Image Grid Can Be Worth a Video: Zero-shot Vi... | 2024-03-27 | Code |
| 40 | VidCtx (7B) | 70.7 | No | VidCtx: Context-aware Video Question Answering w... | 2024-12-23 | Code |
| 41 | MoReVQA(PaLM-2) | 69.2 | No | MoReVQA: Exploring Modular Reasoning Models for ... | 2024-04-09 | - |
| 42 | VideoChat2 | 68.6 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 43 | IG-VLM (GPT-4) | 68.6 | No | An Image Grid Can Be Worth a Video: Zero-shot Vi... | 2024-03-27 | Code |
| 44 | TraveLER (GPT-4) | 68.2 | No | TraveLER: A Modular Multi-LMM Agent Framework fo... | 2024-04-01 | Code |
| 45 | LLoVi (GPT-4) | 67.7 | No | A Simple LLM Framework for Long-Range Video Ques... | 2023-12-28 | Code |
| 46 | LongVA(32 frames) | 67.1 | No | Long Context Transfer from Language to Vision | 2024-06-24 | Code |
| 47 | Q-ViD | 66.3 | No | Question-Instructed Visual Descriptions for Zero... | 2024-02-16 | Code |
| 48 | ProViQ | 64.6 | No | Zero-Shot Video Question Answering with Procedur... | 2023-12-01 | - |
| 49 | SlowFast-LLaVA-34B | 64.2 | No | SlowFast-LLaVA: A Strong Training-Free Baseline ... | 2024-07-22 | Code |
| 50 | Sevila (4B) | 63.6 | No | Self-Chained Image-Language Model for Video Loca... | 2023-05-11 | Code |
| 51 | RTQ | 63.2 | No | RTQ: Rethinking Video-language Understanding Bas... | 2023-12-01 | Code |
| 52 | HiTeA | 63.1 | Yes | HiTeA: Hierarchical Temporal-Aware Video-Languag... | 2022-12-30 | - |
| 53 | VideoChat2 | 61.7 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 54 | DeepStack-L(7B) | 61 | No | DeepStack: Deeply Stacking Visual Tokens is Surp... | 2024-06-06 | - |
| 55 | LangRepo (12B) | 60.9 | No | Language Repository for Long Video Understanding | 2024-03-21 | Code |
| 56 | CoVGT(PT) | 60.7 | Yes | Contrastive Video Question Answering via Video G... | 2023-02-27 | Code |
| 57 | SeViT | 60.6 | No | Semi-Parametric Video-Grounded Text Generation | 2023-01-27 | - |
| 58 | ViperGPT(0-shot) | 60 | No | ViperGPT: Visual Inference via Python Execution ... | 2023-03-14 | Code |
| 59 | CoVGT | 60 | No | Contrastive Video Question Answering via Video G... | 2023-02-27 | Code |
| 60 | ViperGPT (GPT-3.5) | 60 | No | ViperGPT: Visual Inference via Python Execution ... | 2023-03-14 | Code |
| 61 | GF | 58.83 | No | Glance and Focus: Memory Prompting for Multi-Eve... | 2024-01-03 | Code |
| 62 | VFC | 58.6 | Yes | Verbs in Action: Improving verb understanding in... | 2023-04-13 | Code |
| 63 | ATM | 58.3 | No | ATM: Action Temporality Modeling for Video Quest... | 2023-09-05 | - |
| 64 | MIST | 57.2 | No | MIST: Multi-modal Iterative Spatial-Temporal Tra... | 2022-12-19 | Code |
| 65 | VGT(PT) | 56.9 | Yes | Video Graph Transformer for Video Question Answe... | 2022-07-12 | Code |
| 66 | PAXION | 56.9 | Yes | Paxion: Patching Action Knowledge in Video-Langu... | 2023-05-18 | Code |
| 67 | MVU (13B) | 55.2 | No | Understanding Long Videos with Multimodal Langua... | 2024-03-25 | Code |
| 68 | VGT | 55 | No | Video Graph Transformer for Video Question Answe... | 2022-07-12 | Code |
| 69 | ATP | 54.3 | No | Revisiting the "Video" in Video-Language Underst... | 2022-06-03 | Code |
| 70 | LLoVi (7B) | 54.3 | No | A Simple LLM Framework for Long-Range Video Ques... | 2023-12-28 | Code |
| 71 | P3D-G | 53.4 | No | (2.5+1)D Spatio-Temporal Scene Graphs for Video ... | 2022-02-18 | - |
| 72 | VFC | 51.5 | No | Verbs in Action: Improving verb understanding in... | 2023-04-13 | Code |
| 73 | HQGA | 51.4 | No | Video as Conditional Graph Hierarchy for Multi-G... | 2021-12-12 | Code |
| 74 | Mistral (7B) | 51.1 | No | Mistral 7B | 2023-10-10 | Code |