Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
TVBench
Video Question Answering on TVBench
Metric: Average Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Average Accuracy (best first)
Average Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Average Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
Seed1.5-VL thinking
63.6
No
Seed1.5-VL Technical Report
2025-05-11
-
2
PLM-8B
63.5
No
PerceptionLM: Open-Access Data and Models for De...
2025-04-17
Code
3
Seed1.5-VL
61.5
No
Seed1.5-VL Technical Report
2025-05-11
-
4
V-JEPA 2 ViT-g 8B
60.6
No
V-JEPA 2: Self-Supervised Video Models Enable Un...
2025-06-11
Code
5
PLM-3B
58.9
No
PerceptionLM: Open-Access Data and Models for De...
2025-04-17
Code
6
RRPO
56.5
No
Self-alignment of Large Video Language Models wi...
2025-04-16
-
7
Tarsier-34B
55.5
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
8
Tarsier2-7B
54.7
No
Tarsier2: Advancing Large Vision-Language Models...
2025-01-14
Code
9
Qwen2-VL-72B
52.7
No
Qwen2-VL: Enhancing Vision-Language Model's Perc...
2024-09-18
Code
10
IXC-2.5 7B
51.6
No
InternLM-XComposer-2.5: A Versatile Large Vision...
2024-07-03
Code
11
Aria
51
No
Aria: An Open Multimodal Native Mixture-of-Exper...
2024-10-08
Code
12
PLM-1B
50.4
No
PerceptionLM: Open-Access Data and Models for De...
2025-04-17
Code
13
LLaVA-Video 72B
50
No
Video Instruction Tuning With Synthetic Data
2024-10-03
-
14
VideoLLaMA2 72B
48.4
No
VideoLLaMA 2: Advancing Spatial-Temporal Modelin...
2024-06-11
Code
15
Gemini 1.5 Pro
47.6
No
Gemini 1.5: Unlocking multimodal understanding a...
2024-03-08
Code
16
Tarsier-7B
46.9
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
17
LLaVA-Video 7B
45.6
No
Video Instruction Tuning With Synthetic Data
2024-10-03
-
18
Qwen2-VL-7B
43.8
No
Qwen2-VL: Enhancing Vision-Language Model's Perc...
2024-09-18
Code
19
VideoLLaMA2 7B
42.9
No
VideoLLaMA 2: Advancing Spatial-Temporal Modelin...
2024-06-11
Code
20
PLLaVA-34B
42.3
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
21
mPLUG-Owl3
42.2
No
mPLUG-Owl3: Towards Long Image-Sequence Understa...
2024-08-09
Code
22
VideoLLaMA2.1
42.1
No
VideoLLaMA 2: Advancing Spatial-Temporal Modelin...
2024-06-11
Code
23
VideoGPT+
41.7
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
24
GPT4o 8 frames
39.9
No
GPT-4o System Card
2024-10-25
-
25
PLLaVA-13B
36.4
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
26
ST-LLM
35.7
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
27
VideoChat2
35
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
28
PLLaVA-7B
34.9
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
#1
Seed1.5-VL thinking
SOTA
63.6
Average Accuracy
· 2025-05-11
Seed1.5-VL Technical Report
#2
PLM-8B
SOTA
63.5
Average Accuracy
· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Code
#3
Seed1.5-VL
61.5
Average Accuracy
· 2025-05-11
Seed1.5-VL Technical Report
#4
V-JEPA 2 ViT-g 8B
60.6
Average Accuracy
· 2025-06-11
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Code
#5
PLM-3B
58.9
Average Accuracy
· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Code
#6
RRPO
SOTA
56.5
Average Accuracy
· 2025-04-16
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
#7
Tarsier-34B
SOTA
55.5
Average Accuracy
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#8
Tarsier2-7B
54.7
Average Accuracy
· 2025-01-14
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
Code
#9
Qwen2-VL-72B
52.7
Average Accuracy
· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Code
#10
IXC-2.5 7B
51.6
Average Accuracy
· 2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Code
#11
Aria
51
Average Accuracy
· 2024-10-08
Aria: An Open Multimodal Native Mixture-of-Experts Model
Code
#12
PLM-1B
50.4
Average Accuracy
· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Code
#13
LLaVA-Video 72B
50
Average Accuracy
· 2024-10-03
Video Instruction Tuning With Synthetic Data
#14
VideoLLaMA2 72B
SOTA
48.4
Average Accuracy
· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Code
#15
Gemini 1.5 Pro
SOTA
47.6
Average Accuracy
· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Code
#16
Tarsier-7B
46.9
Average Accuracy
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#17
LLaVA-Video 7B
45.6
Average Accuracy
· 2024-10-03
Video Instruction Tuning With Synthetic Data
#18
Qwen2-VL-7B
43.8
Average Accuracy
· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Code
#19
VideoLLaMA2 7B
42.9
Average Accuracy
· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Code
#20
PLLaVA-34B
42.3
Average Accuracy
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#21
mPLUG-Owl3
42.2
Average Accuracy
· 2024-08-09
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Code
#22
VideoLLaMA2.1
42.1
Average Accuracy
· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Code
#23
VideoGPT+
41.7
Average Accuracy
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#24
GPT4o 8 frames
39.9
Average Accuracy
· 2024-10-25
GPT-4o System Card
#25
PLLaVA-13B
36.4
Average Accuracy
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#26
ST-LLM
35.7
Average Accuracy
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#27
VideoChat2
SOTA
35
Average Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#28
PLLaVA-7B
34.9
Average Accuracy
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code