Video Question Answering on MVBench

Metric: Avg. (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Avg.▼	Extra Data	Paper	Date↕	Code
1	LinVT-Qwen2-VL (7B)	69.3	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
2	Tarsier (34B)	67.6	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
3	InternVideo2	67.2	No	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
4	LongVU (7B)	66.9	No	LongVU: Spatiotemporal Adaptive Compression for ...	2024-10-22	Code
5	Oryx(34B)	64.7	No	Oryx MLLM: On-Demand Spatial-Temporal Understand...	2024-09-19	Code
6	VideoLLaMA2 (72B)	62	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
7	VideoChat-T (7B)	59.9	No	TimeSuite: Improving MLLMs for Long Video Unders...	2024-10-25	Code
8	mPLUG-Owl3(7B)	59.5	No	mPLUG-Owl3: Towards Long Image-Sequence Understa...	2024-08-09	Code
9	PPLLaVA (7b)	59.2	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
10	VideoGPT+	58.7	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
11	PLLaVA	58.1	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
12	ST-LLM	54.9	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
13	VideoChat2	51.9	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
14	HawkEye	47.55	No	HawkEye: Training Video-Text LLMs for Grounding ...	2024-03-15	Code
15	SPHINX-Plus	39.7	No	SPHINX-X: Scaling Data and Parameters for a Fami...	2024-02-08	Code
16	TimeChat	38.5	No	TimeChat: A Time-sensitive Multimodal Large Lang...	2023-12-04	Code
17	LLaVa	36	No	Visual Instruction Tuning	2023-04-17	Code
18	VideoChat	35.5	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
19	VideoLLaMA	34.1	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code
20	Video-ChatGPT	32.7	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
21	InstructBLIP	32.5	No	InstructBLIP: Towards General-purpose Vision-Lan...	2023-05-11	Code
22	MiniGPT4	18.8	No	MiniGPT-4: Enhancing Vision-Language Understandi...	2023-04-20	Code

#1LinVT-Qwen2-VL (7B)SOTA
69.3
Avg.· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#2Tarsier (34B)SOTA
67.6
Avg.· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#3InternVideo2SOTA
67.2
Avg.· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#4LongVU (7B)
66.9
Avg.· 2024-10-22
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Code
#5Oryx(34B)
64.7
Avg.· 2024-09-19
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Code
#6VideoLLaMA2 (72B)
62
Avg.· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#7VideoChat-T (7B)
59.9
Avg.· 2024-10-25
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Code
#8mPLUG-Owl3(7B)
59.5
Avg.· 2024-08-09
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Code
#9PPLLaVA (7b)
59.2
Avg.· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#10VideoGPT+
58.7
Avg.· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#11PLLaVA
58.1
Avg.· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#12ST-LLM
54.9
Avg.· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#13VideoChat2SOTA
51.9
Avg.· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#14HawkEye
47.55
Avg.· 2024-03-15
HawkEye: Training Video-Text LLMs for Grounding Text in Videos Code
#15SPHINX-Plus
39.7
Avg.· 2024-02-08
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Code
#16TimeChat
38.5
Avg.· 2023-12-04
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding Code
#17LLaVaSOTA
36
Avg.· 2023-04-17
Visual Instruction Tuning Code
#18VideoChat
35.5
Avg.· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#19VideoLLaMA
34.1
Avg.· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code
#20Video-ChatGPT
32.7
Avg.· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#21InstructBLIP
32.5
Avg.· 2023-05-11
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning Code
#22MiniGPT4
18.8
Avg.· 2023-04-20
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Code