TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Question Answering/MSVD-QA

Question Answering on MSVD-QA

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1Tarsier (34B)80.3NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
2Flash-VStream80.3NoFlash-VStream: Memory-Based Real-Time Understand...2024-06-12Code
3LinVT-Qwen2-VL (7B)80.2NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code
4VILA1.5-40B80.1NoVILA: On Pre-training for Visual Language Models2023-12-12Code
5PLLaVA (34B)79.9NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
6SlowFast-LLaVA-34B79.9NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
7IG-VLM-34B79.6NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
8TS-LLaVA-34B79.4NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
9PPLLaVA-7B77.1NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
10Elysium75.8NoElysium: Exploring Object-level Perception in Vi...2024-03-25Code
11MovieChat75.2NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
12ST-LLM74.6NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
13MiniGPT4-video-7B73.92NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
14Video-LaVIT73.2NoVideo-LaVIT: Unified Video-Language Pre-training...2024-02-05Code
15VideoGPT+72.4NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
16LLaVA-Mini70.9NoLLaVA-Mini: Efficient Image and Video Large Mult...2025-01-07Code
17Video-LLaVA-7B70.7NoVideo-LLaVA: Learning United Visual Representati...2023-11-16Code
18VideoChat270NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
19LLaMA-VID-13B (2 Token)70NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
20LLaMA-VID-7B (2 Token)69.7NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
21Chat-UniVi-7B69.3NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
22BT-Adapter (zero-shot)67NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
23BT-Adapter (zero-shot)67NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
24Video-ChatGPT-7B64.9NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
25Video Chat-7B56.3NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
26LLaMA Adapter-7B54.9NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
27Video LLaMA-7B51.6NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
28FrozenBiLM33.8NoZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code