TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Video Question Answering/MSRVTT-QA

Video Question Answering on MSRVTT-QA

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1Flash-VStream72.4NoFlash-VStream: Memory-Based Real-Time Understand...2024-06-12Code
2PLLaVA (34B)68.7NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
3Elysium67.5NoElysium: Exploring Object-level Perception in Vi...2024-03-25Code
4SlowFast-LLaVA-34B67.4NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
5Tarsier (34B)66.4NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
6LinVT-Qwen2-VL (7B)66.2NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code
7TS-LLaVA-34B66.2NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
8PPLLaVA-7B64.3NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
9IG-VLM63.8NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
10ST-LLM63.2NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
11CAT-7B62.1NoCAT: Enhancing Multimodal Large Language Model t...2024-03-07Code
12VideoGPT+60.6NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
13Vista-LLaMA-7B60.5NoVista-LLaMA: Reliable Video Narrator via Equal D...2023-12-12-
14MiniGPT4-video-7B59.73NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
15LLaVA-Mini59.5NoLLaVA-Mini: Efficient Image and Video Large Mult...2025-01-07Code
16Video-LaVIT59.3NoVideo-LaVIT: Unified Video-Language Pre-training...2024-02-05Code
17Video-LLaVA-7B59.2YesVideo-LLaVA: Learning United Visual Representati...2023-11-16Code
18LLaMA-VID-13B (2 Token)58.9NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
19LLaMA-VID-7B (2 Token)57.7NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
20SUM-shot+Vicuna56.8NoShot2Story20K: A New Benchmark for Comprehensive...2023-12-16Code
21Omni-VideoAssistant55.3NoOmniDataComposer: A Unified Data Structure for M...2023-08-08Code
22Chat-UniVi-7B55NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
23VideoChat254.1NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
24MovieChat52.7NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
25BT-Adapter (zero-shot)51.2NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
26BT-Adapter (zero-shot)51.2NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
27Mirasol3B50.42NoMirasol3B: A Multimodal Autoregressive model for...2023-11-09-
28VAST50.1YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
29Video-ChatGPT-7B49.3NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
30VALOR49.2YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
31COSA49.2YesCOSA: Concatenated Sample Pretrained Vision-Lang...2023-06-15Code
32MA-LMM48.5NoMA-LMM: Memory-Augmented Large Multimodal Model ...2024-04-08Code
33mPLUG-248NomPLUG-2: A Modularized Multi-modal Foundation Mo...2023-02-01Code
34FrozenBiLM47YesZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code
35HBI46.2NoVideo-Text as Game Players: Hierarchical Banzhaf...2023-03-25Code
36EMCL-Net45.8NoExpectation-Maximization Contrastive Learning fo...2022-11-21Code
37Video Chat-7B45NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
38VindLU44.6YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
39VIOLETv244.5NoAn Empirical Study of End-to-End Video-Language ...2022-09-04Code
40Singularity-temporal43.9NoRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
41LLaMA Adapter-7B43.8NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
42Singularity43.5NoRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
43Video LLaMA-7B29.6NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
44FrozenBiLM (0-shot)16.7NoZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code