| 1 | Tarsier (34B) | 61.6 | No | Tarsier: Recipes for Training and Evaluating Lar... | 2024-06-30 | Code |
| 2 | GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot) | 61.2 | No | Composing Ensembles of Pre-trained Models via It... | 2022-10-20 | - |
| 3 | PLLaVA (34B) | 60.9 | No | PLLaVA : Parameter-free LLaVA Extension from Ima... | 2024-04-25 | Code |
| 4 | PPLLaVA-7B | 60.7 | No | PPLLaVA: Varied Video Sequence Understanding Wit... | 2024-11-04 | Code |
| 5 | LinVT-Qwen2-VL(7B) | 60.1 | No | LinVT: Empower Your Image-level Large Language M... | 2024-12-06 | Code |
| 6 | SlowFast-LLaVA-34B | 59.2 | No | SlowFast-LLaVA: A Strong Training-Free Baseline ... | 2024-07-22 | Code |
| 7 | TS-LLaVA-34B | 58.9 | No | TS-LLaVA: Constructing Visual Tokens through Thu... | 2024-11-17 | Code |
| 8 | GPT-2 + CLIP-32 (Zero-Shot) | 58.4 | No | Composing Ensembles of Pre-trained Models via It... | 2022-10-20 | - |
| 9 | IG-VLM | 58.4 | No | An Image Grid Can Be Worth a Video: Zero-shot Vi... | 2024-03-27 | Code |
| 10 | VideoCoCa | 56.1 | Yes | VideoCoCa: Video-Text Modeling with Zero-Shot Tr... | 2022-12-09 | - |
| 11 | LLaVA-Mini | 53.5 | No | LLaVA-Mini: Efficient Image and Video Large Mult... | 2025-01-07 | Code |
| 12 | Flash-VStream | 51.9 | No | Flash-VStream: Memory-Based Real-Time Understand... | 2024-06-12 | Code |
| 13 | Mirasol3B | 51.13 | No | Mirasol3B: A Multimodal Autoregressive model for... | 2023-11-09 | - |
| 14 | ST-LLM | 50.9 | No | ST-LLM: Large Language Models Are Effective Temp... | 2024-03-30 | Code |
| 15 | VideoGPT+ | 50.6 | No | VideoGPT+: Integrating Image and Video Encoders ... | 2024-06-13 | Code |
| 16 | VAST | 50.4 | Yes | VAST: A Vision-Audio-Subtitle-Text Omni-Modality... | 2023-05-29 | Code |
| 17 | CAT-7B | 50.2 | No | CAT: Enhancing Multimodal Large Language Model t... | 2024-03-07 | Code |
| 18 | Video-LaVIT | 50.1 | No | Video-LaVIT: Unified Video-Language Pre-training... | 2024-02-05 | Code |
| 19 | COSA | 49.9 | Yes | COSA: Concatenated Sample Pretrained Vision-Lang... | 2023-06-15 | Code |
| 20 | MA-LMM | 49.8 | No | MA-LMM: Memory-Augmented Large Multimodal Model ... | 2024-04-08 | Code |
| 21 | VideoChat2 | 49.1 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 22 | VideoChat2 | 49.1 | No | MVBench: A Comprehensive Multi-modal Video Under... | 2023-11-28 | Code |
| 23 | VALOR | 48.6 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |
| 24 | UMT-L (ViT-L/16) | 47.9 | Yes | Unmasked Teacher: Towards Training-Efficient Vid... | 2023-03-28 | Code |
| 25 | LLaMA-VID-13B (2 Token) | 47.5 | No | LLaMA-VID: An Image is Worth 2 Tokens in Large L... | 2023-11-28 | Code |
| 26 | LLaMA-VID-13B (2 Token) | 47.5 | No | LLaMA-VID: An Image is Worth 2 Tokens in Large L... | 2023-11-28 | Code |
| 27 | LLaMA-VID-7B (2 Token) | 47.4 | No | LLaMA-VID: An Image is Worth 2 Tokens in Large L... | 2023-11-28 | Code |
| 28 | LLaMA-VID-7B (2 Token) | 47.4 | No | LLaMA-VID: An Image is Worth 2 Tokens in Large L... | 2023-11-28 | Code |
| 29 | Chat-UniVi-13B | 46.4 | No | Chat-UniVi: Unified Visual Representation Empowe... | 2023-11-14 | Code |
| 30 | Chat-UniVi-13B | 46.4 | No | Chat-UniVi: Unified Visual Representation Empowe... | 2023-11-14 | Code |
| 31 | MiniGPT4-video-7B | 46.3 | No | MiniGPT4-Video: Advancing Multimodal LLMs for Vi... | 2024-04-04 | Code |
| 32 | BT-Adapter (zero-shot) | 46.1 | No | BT-Adapter: Video Conversation is Feasible Witho... | 2023-09-27 | Code |
| 33 | Chat-UniVi | 46.1 | No | Chat-UniVi: Unified Visual Representation Empowe... | 2023-11-14 | Code |
| 34 | BT-Adapter (zero-shot) | 46.1 | No | BT-Adapter: Video Conversation is Feasible Witho... | 2023-09-27 | Code |
| 35 | MovieChat | 45.7 | No | MovieChat: From Dense Token to Sparse Memory for... | 2023-07-31 | Code |
| 36 | MovieChat | 45.7 | No | MovieChat: From Dense Token to Sparse Memory for... | 2023-07-31 | Code |
| 37 | Video-LLaVA | 45.3 | No | Video-LLaVA: Learning United Visual Representati... | 2023-11-16 | Code |
| 38 | Video-LLaVA | 45.3 | No | Video-LLaVA: Learning United Visual Representati... | 2023-11-16 | Code |
| 39 | TESTA (ViT-B/16) | 45 | Yes | TESTA: Temporal-Spatial Token Aggregation for Lo... | 2023-10-29 | Code |
| 40 | FrozenBiLM+ | 44.8 | No | Open-vocabulary Video Question Answering: A New ... | 2023-08-18 | Code |
| 41 | VindLU | 44.7 | Yes | VindLU: A Recipe for Effective Video-and-Languag... | 2022-12-09 | Code |
| 42 | Singularity-temporal | 44.1 | Yes | Revealing Single Frame Bias for Video-and-Langua... | 2022-06-07 | Code |
| 43 | Elysium | 43.4 | No | Elysium: Exploring Object-level Perception in Vi... | 2024-03-25 | Code |
| 44 | FrozenBiLM | 43.2 | Yes | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 45 | Singularity | 43.1 | Yes | Revealing Single Frame Bias for Video-and-Langua... | 2022-06-07 | Code |
| 46 | Text + Text (no Multimodal Pretext Training) | 41.4 | No | Towards Fast Adaptation of Pretrained Contrastiv... | 2022-06-05 | Code |
| 47 | All-in-one+ | 40 | No | Open-vocabulary Video Question Answering: A New ... | 2023-08-18 | Code |
| 48 | VIOLET+ | 39.7 | No | Open-vocabulary Video Question Answering: A New ... | 2023-08-18 | Code |
| 49 | Just Ask (fine-tune) | 38.9 | No | Just Ask: Learning to Answer Questions from Mill... | 2020-12-01 | Code |
| 50 | LocVLM-Vid-B+ | 38.2 | No | Learning to Localize Objects Improves Spatial Re... | 2024-04-11 | Code |
| 51 | LocVLM-Vid-B | 37.4 | No | Learning to Localize Objects Improves Spatial Re... | 2024-04-11 | Code |
| 52 | Video-ChatGPT | 35.2 | No | Video-ChatGPT: Towards Detailed Video Understand... | 2023-06-08 | Code |
| 53 | Video-ChatGPT | 35.2 | No | Video-ChatGPT: Towards Detailed Video Understand... | 2023-06-08 | Code |
| 54 | LLaMA Adapter V2 | 34.2 | No | LLaMA-Adapter V2: Parameter-Efficient Visual Ins... | 2023-04-28 | Code |
| 55 | LLaMA Adapter | 34.2 | No | LLaMA-Adapter V2: Parameter-Efficient Visual Ins... | 2023-04-28 | Code |
| 56 | E-SA | 31.8 | No | ActivityNet-QA: A Dataset for Understanding Comp... | 2019-06-06 | Code |
| 57 | E-MN | 27.1 | No | ActivityNet-QA: A Dataset for Understanding Comp... | 2019-06-06 | Code |
| 58 | Video Chat | 26.5 | No | VideoChat: Chat-Centric Video Understanding | 2023-05-10 | Code |
| 59 | Video Chat | 26.5 | No | VideoChat: Chat-Centric Video Understanding | 2023-05-10 | Code |
| 60 | FrozenBiLM (0-shot) | 25.9 | No | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 61 | E-VQA | 25.1 | No | ActivityNet-QA: A Dataset for Understanding Comp... | 2019-06-06 | Code |
| 62 | FrozenBiLM | 24.7 | No | Zero-Shot Video Question Answering via Frozen Bi... | 2022-06-16 | Code |
| 63 | Video LLaMA | 12.4 | No | Video-LLaMA: An Instruction-tuned Audio-Visual L... | 2023-06-05 | Code |
| 64 | Just Ask (0-shot) | 12.2 | No | Just Ask: Learning to Answer Questions from Mill... | 2020-12-01 | Code |