Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
MSRVTT-QA
Video Question Answering on MSRVTT-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
Flash-VStream
72.4
No
Flash-VStream: Memory-Based Real-Time Understand...
2024-06-12
Code
2
PLLaVA (34B)
68.7
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
3
Elysium
67.5
No
Elysium: Exploring Object-level Perception in Vi...
2024-03-25
Code
4
SlowFast-LLaVA-34B
67.4
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
5
Tarsier (34B)
66.4
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
6
LinVT-Qwen2-VL (7B)
66.2
No
LinVT: Empower Your Image-level Large Language M...
2024-12-06
Code
7
TS-LLaVA-34B
66.2
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
8
PPLLaVA-7B
64.3
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
9
IG-VLM
63.8
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
10
ST-LLM
63.2
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
11
CAT-7B
62.1
No
CAT: Enhancing Multimodal Large Language Model t...
2024-03-07
Code
12
VideoGPT+
60.6
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
13
Vista-LLaMA-7B
60.5
No
Vista-LLaMA: Reliable Video Narrator via Equal D...
2023-12-12
-
14
MiniGPT4-video-7B
59.73
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
15
LLaVA-Mini
59.5
No
LLaVA-Mini: Efficient Image and Video Large Mult...
2025-01-07
Code
16
Video-LaVIT
59.3
No
Video-LaVIT: Unified Video-Language Pre-training...
2024-02-05
Code
17
Video-LLaVA-7B
59.2
Yes
Video-LLaVA: Learning United Visual Representati...
2023-11-16
Code
18
LLaMA-VID-13B (2 Token)
58.9
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
19
LLaMA-VID-7B (2 Token)
57.7
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
20
SUM-shot+Vicuna
56.8
No
Shot2Story20K: A New Benchmark for Comprehensive...
2023-12-16
Code
21
Omni-VideoAssistant
55.3
No
OmniDataComposer: A Unified Data Structure for M...
2023-08-08
Code
22
Chat-UniVi-7B
55
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
23
VideoChat2
54.1
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
24
MovieChat
52.7
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
25
BT-Adapter (zero-shot)
51.2
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
26
BT-Adapter (zero-shot)
51.2
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
27
Mirasol3B
50.42
No
Mirasol3B: A Multimodal Autoregressive model for...
2023-11-09
-
28
VAST
50.1
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
29
Video-ChatGPT-7B
49.3
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
30
VALOR
49.2
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
31
COSA
49.2
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
32
MA-LMM
48.5
No
MA-LMM: Memory-Augmented Large Multimodal Model ...
2024-04-08
Code
33
mPLUG-2
48
No
mPLUG-2: A Modularized Multi-modal Foundation Mo...
2023-02-01
Code
34
FrozenBiLM
47
Yes
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
35
HBI
46.2
No
Video-Text as Game Players: Hierarchical Banzhaf...
2023-03-25
Code
36
EMCL-Net
45.8
No
Expectation-Maximization Contrastive Learning fo...
2022-11-21
Code
37
Video Chat-7B
45
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
38
VindLU
44.6
Yes
VindLU: A Recipe for Effective Video-and-Languag...
2022-12-09
Code
39
VIOLETv2
44.5
No
An Empirical Study of End-to-End Video-Language ...
2022-09-04
Code
40
Singularity-temporal
43.9
No
Revealing Single Frame Bias for Video-and-Langua...
2022-06-07
Code
41
LLaMA Adapter-7B
43.8
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
42
Singularity
43.5
No
Revealing Single Frame Bias for Video-and-Langua...
2022-06-07
Code
43
Video LLaMA-7B
29.6
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
44
FrozenBiLM (0-shot)
16.7
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code