TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Video Question Answering/MSRVTT-QA

Video Question Answering on MSRVTT-QA

Metric: Confidence Score (lower is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Confidence Score▲Extra DataPaperDate↕Code
1Video LLaMA-7B1.8NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
2Video Chat-7B2.5NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
3MovieChat2.6NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
4LLaMA Adapter-7B2.7NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
5Video-ChatGPT-7B2.8NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
6BT-Adapter (zero-shot)2.9NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
7BT-Adapter (zero-shot)2.9NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
8Chat-UniVi-7B3.1NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
9Elysium3.2NoElysium: Exploring Object-level Perception in Vi...2024-03-25Code
10LLaMA-VID-7B (2 Token)3.2NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
11Vista-LLaMA-7B3.3NoVista-LLaMA: Reliable Video Narrator via Equal D...2023-12-12-
12Video-LaVIT3.3NoVideo-LaVIT: Unified Video-Language Pre-training...2024-02-05Code
13LLaMA-VID-13B (2 Token)3.3NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
14Omni-VideoAssistant3.3NoOmniDataComposer: A Unified Data Structure for M...2023-08-08Code
15VideoChat23.3NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
16Flash-VStream3.4NoFlash-VStream: Memory-Based Real-Time Understand...2024-06-12Code
17ST-LLM3.4NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
18PPLLaVA-7B3.5NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
19IG-VLM3.5NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
20CAT-7B3.5NoCAT: Enhancing Multimodal Large Language Model t...2024-03-07Code
21Video-LLaVA-7B3.5YesVideo-LLaVA: Learning United Visual Representati...2023-11-16Code
22PLLaVA (34B)3.6NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
23TS-LLaVA-34B3.6NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
24VideoGPT+3.6NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
25LLaVA-Mini3.6NoLLaVA-Mini: Efficient Image and Video Large Mult...2025-01-07Code
26SlowFast-LLaVA-34B3.7NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
27Tarsier (34B)3.7NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
28LinVT-Qwen2-VL (7B)4NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code