Video Question Answering on TVBench

Metric: Average Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Average Accuracy▼	Extra Data	Paper	Date↕	Code
1	Seed1.5-VL thinking	63.6	No	Seed1.5-VL Technical Report	2025-05-11	-
2	PLM-8B	63.5	No	PerceptionLM: Open-Access Data and Models for De...	2025-04-17	Code
3	Seed1.5-VL	61.5	No	Seed1.5-VL Technical Report	2025-05-11	-
4	V-JEPA 2 ViT-g 8B	60.6	No	V-JEPA 2: Self-Supervised Video Models Enable Un...	2025-06-11	Code
5	PLM-3B	58.9	No	PerceptionLM: Open-Access Data and Models for De...	2025-04-17	Code
6	RRPO	56.5	No	Self-alignment of Large Video Language Models wi...	2025-04-16	-
7	Tarsier-34B	55.5	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
8	Tarsier2-7B	54.7	No	Tarsier2: Advancing Large Vision-Language Models...	2025-01-14	Code
9	Qwen2-VL-72B	52.7	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
10	IXC-2.5 7B	51.6	No	InternLM-XComposer-2.5: A Versatile Large Vision...	2024-07-03	Code
11	Aria	51	No	Aria: An Open Multimodal Native Mixture-of-Exper...	2024-10-08	Code
12	PLM-1B	50.4	No	PerceptionLM: Open-Access Data and Models for De...	2025-04-17	Code
13	LLaVA-Video 72B	50	No	Video Instruction Tuning With Synthetic Data	2024-10-03	-
14	VideoLLaMA2 72B	48.4	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
15	Gemini 1.5 Pro	47.6	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
16	Tarsier-7B	46.9	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
17	LLaVA-Video 7B	45.6	No	Video Instruction Tuning With Synthetic Data	2024-10-03	-
18	Qwen2-VL-7B	43.8	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
19	VideoLLaMA2 7B	42.9	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
20	PLLaVA-34B	42.3	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
21	mPLUG-Owl3	42.2	No	mPLUG-Owl3: Towards Long Image-Sequence Understa...	2024-08-09	Code
22	VideoLLaMA2.1	42.1	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
23	VideoGPT+	41.7	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
24	GPT4o 8 frames	39.9	No	GPT-4o System Card	2024-10-25	-
25	PLLaVA-13B	36.4	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
26	ST-LLM	35.7	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
27	VideoChat2	35	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
28	PLLaVA-7B	34.9	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code

#1Seed1.5-VL thinkingSOTA
63.6
Average Accuracy· 2025-05-11
Seed1.5-VL Technical Report
#2PLM-8BSOTA
63.5
Average Accuracy· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Code
#3Seed1.5-VL
61.5
Average Accuracy· 2025-05-11
Seed1.5-VL Technical Report
#4V-JEPA 2 ViT-g 8B
60.6
Average Accuracy· 2025-06-11
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning Code
#5PLM-3B
58.9
Average Accuracy· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Code
#6RRPOSOTA
56.5
Average Accuracy· 2025-04-16
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
#7Tarsier-34BSOTA
55.5
Average Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#8Tarsier2-7B
54.7
Average Accuracy· 2025-01-14
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding Code
#9Qwen2-VL-72B
52.7
Average Accuracy· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#10IXC-2.5 7B
51.6
Average Accuracy· 2024-07-03
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output Code
#11Aria
51
Average Accuracy· 2024-10-08
Aria: An Open Multimodal Native Mixture-of-Experts Model Code
#12PLM-1B
50.4
Average Accuracy· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Code
#13LLaVA-Video 72B
50
Average Accuracy· 2024-10-03
Video Instruction Tuning With Synthetic Data
#14VideoLLaMA2 72BSOTA
48.4
Average Accuracy· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#15Gemini 1.5 ProSOTA
47.6
Average Accuracy· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#16Tarsier-7B
46.9
Average Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#17LLaVA-Video 7B
45.6
Average Accuracy· 2024-10-03
Video Instruction Tuning With Synthetic Data
#18Qwen2-VL-7B
43.8
Average Accuracy· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#19VideoLLaMA2 7B
42.9
Average Accuracy· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#20PLLaVA-34B
42.3
Average Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#21mPLUG-Owl3
42.2
Average Accuracy· 2024-08-09
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Code
#22VideoLLaMA2.1
42.1
Average Accuracy· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#23VideoGPT+
41.7
Average Accuracy· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#24GPT4o 8 frames
39.9
Average Accuracy· 2024-10-25
GPT-4o System Card
#25PLLaVA-13B
36.4
Average Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#26ST-LLM
35.7
Average Accuracy· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#27VideoChat2SOTA
35
Average Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#28PLLaVA-7B
34.9
Average Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code