Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Question Answering
/
MSVD-QA
Question Answering on MSVD-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
Tarsier (34B)
80.3
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
2
Flash-VStream
80.3
No
Flash-VStream: Memory-Based Real-Time Understand...
2024-06-12
Code
3
LinVT-Qwen2-VL (7B)
80.2
No
LinVT: Empower Your Image-level Large Language M...
2024-12-06
Code
4
VILA1.5-40B
80.1
No
VILA: On Pre-training for Visual Language Models
2023-12-12
Code
5
PLLaVA (34B)
79.9
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
6
SlowFast-LLaVA-34B
79.9
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
7
IG-VLM-34B
79.6
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
8
TS-LLaVA-34B
79.4
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
9
PPLLaVA-7B
77.1
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
10
Elysium
75.8
No
Elysium: Exploring Object-level Perception in Vi...
2024-03-25
Code
11
MovieChat
75.2
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
12
ST-LLM
74.6
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
13
MiniGPT4-video-7B
73.92
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
14
Video-LaVIT
73.2
No
Video-LaVIT: Unified Video-Language Pre-training...
2024-02-05
Code
15
VideoGPT+
72.4
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
16
LLaVA-Mini
70.9
No
LLaVA-Mini: Efficient Image and Video Large Mult...
2025-01-07
Code
17
Video-LLaVA-7B
70.7
No
Video-LLaVA: Learning United Visual Representati...
2023-11-16
Code
18
VideoChat2
70
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
19
LLaMA-VID-13B (2 Token)
70
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
20
LLaMA-VID-7B (2 Token)
69.7
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
21
Chat-UniVi-7B
69.3
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
22
BT-Adapter (zero-shot)
67
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
23
BT-Adapter (zero-shot)
67
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
24
Video-ChatGPT-7B
64.9
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
25
Video Chat-7B
56.3
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
26
LLaMA Adapter-7B
54.9
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
27
Video LLaMA-7B
51.6
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
28
FrozenBiLM
33.8
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
#1
Tarsier (34B)
80.3
Accuracy
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#2
Flash-VStream
SOTA
80.3
Accuracy
· 2024-06-12
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Code
#3
LinVT-Qwen2-VL (7B)
80.2
Accuracy
· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Code
#4
VILA1.5-40B
SOTA
80.1
Accuracy
· 2023-12-12
VILA: On Pre-training for Visual Language Models
Code
#5
PLLaVA (34B)
79.9
Accuracy
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#6
SlowFast-LLaVA-34B
79.9
Accuracy
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#7
IG-VLM-34B
79.6
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#8
TS-LLaVA-34B
79.4
Accuracy
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#9
PPLLaVA-7B
77.1
Accuracy
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#10
Elysium
75.8
Accuracy
· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM
Code
#11
MovieChat
SOTA
75.2
Accuracy
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#12
ST-LLM
74.6
Accuracy
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#13
MiniGPT4-video-7B
73.92
Accuracy
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#14
Video-LaVIT
73.2
Accuracy
· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Code
#15
VideoGPT+
72.4
Accuracy
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#16
LLaVA-Mini
70.9
Accuracy
· 2025-01-07
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Code
#17
Video-LLaVA-7B
70.7
Accuracy
· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Code
#18
VideoChat2
70
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#19
LLaMA-VID-13B (2 Token)
70
Accuracy
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#20
LLaMA-VID-7B (2 Token)
69.7
Accuracy
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#21
Chat-UniVi-7B
69.3
Accuracy
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#22
BT-Adapter (zero-shot)
67
Accuracy
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#23
BT-Adapter (zero-shot)
67
Accuracy
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#24
Video-ChatGPT-7B
SOTA
64.9
Accuracy
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#25
Video Chat-7B
SOTA
56.3
Accuracy
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#26
LLaMA Adapter-7B
SOTA
54.9
Accuracy
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#27
Video LLaMA-7B
51.6
Accuracy
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code
#28
FrozenBiLM
SOTA
33.8
Accuracy
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code