Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
MVBench
Video Question Answering on MVBench
Metric: Avg. (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Avg. (best first)
Avg. (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Avg.
▼
Extra Data
Paper
Date
↕
Code
1
LinVT-Qwen2-VL (7B)
69.3
No
LinVT: Empower Your Image-level Large Language M...
2024-12-06
Code
2
Tarsier (34B)
67.6
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
3
InternVideo2
67.2
No
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
4
LongVU (7B)
66.9
No
LongVU: Spatiotemporal Adaptive Compression for ...
2024-10-22
Code
5
Oryx(34B)
64.7
No
Oryx MLLM: On-Demand Spatial-Temporal Understand...
2024-09-19
Code
6
VideoLLaMA2 (72B)
62
No
VideoLLaMA 2: Advancing Spatial-Temporal Modelin...
2024-06-11
Code
7
VideoChat-T (7B)
59.9
No
TimeSuite: Improving MLLMs for Long Video Unders...
2024-10-25
Code
8
mPLUG-Owl3(7B)
59.5
No
mPLUG-Owl3: Towards Long Image-Sequence Understa...
2024-08-09
Code
9
PPLLaVA (7b)
59.2
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
10
VideoGPT+
58.7
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
11
PLLaVA
58.1
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
12
ST-LLM
54.9
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
13
VideoChat2
51.9
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
14
HawkEye
47.55
No
HawkEye: Training Video-Text LLMs for Grounding ...
2024-03-15
Code
15
SPHINX-Plus
39.7
No
SPHINX-X: Scaling Data and Parameters for a Fami...
2024-02-08
Code
16
TimeChat
38.5
No
TimeChat: A Time-sensitive Multimodal Large Lang...
2023-12-04
Code
17
LLaVa
36
No
Visual Instruction Tuning
2023-04-17
Code
18
VideoChat
35.5
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
19
VideoLLaMA
34.1
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
20
Video-ChatGPT
32.7
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
21
InstructBLIP
32.5
No
InstructBLIP: Towards General-purpose Vision-Lan...
2023-05-11
Code
22
MiniGPT4
18.8
No
MiniGPT-4: Enhancing Vision-Language Understandi...
2023-04-20
Code
#1
LinVT-Qwen2-VL (7B)
SOTA
69.3
Avg.
· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Code
#2
Tarsier (34B)
SOTA
67.6
Avg.
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#3
InternVideo2
SOTA
67.2
Avg.
· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
Code
#4
LongVU (7B)
66.9
Avg.
· 2024-10-22
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Code
#5
Oryx(34B)
64.7
Avg.
· 2024-09-19
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Code
#6
VideoLLaMA2 (72B)
62
Avg.
· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Code
#7
VideoChat-T (7B)
59.9
Avg.
· 2024-10-25
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Code
#8
mPLUG-Owl3(7B)
59.5
Avg.
· 2024-08-09
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Code
#9
PPLLaVA (7b)
59.2
Avg.
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#10
VideoGPT+
58.7
Avg.
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#11
PLLaVA
58.1
Avg.
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#12
ST-LLM
54.9
Avg.
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#13
VideoChat2
SOTA
51.9
Avg.
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#14
HawkEye
47.55
Avg.
· 2024-03-15
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Code
#15
SPHINX-Plus
39.7
Avg.
· 2024-02-08
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Code
#16
TimeChat
38.5
Avg.
· 2023-12-04
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Code
#17
LLaVa
SOTA
36
Avg.
· 2023-04-17
Visual Instruction Tuning
Code
#18
VideoChat
35.5
Avg.
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#19
VideoLLaMA
34.1
Avg.
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code
#20
Video-ChatGPT
32.7
Avg.
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#21
InstructBLIP
32.5
Avg.
· 2023-05-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Code
#22
MiniGPT4
18.8
Avg.
· 2023-04-20
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Code