MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

2023-11-28CVPR 2024 1Fairness Zero-Shot Video Question Answer VCGBench-Diverse Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question Answering Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)3D Question Answering (3D-QA)Diagnostic Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding Zero-Shot Learning Multiple-choice

Paper PDF Code Code(official)Code

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	TVQA	Accuracy	40.6	VideoChat2
Question Answering	NExT-QA	Accuracy	61.7	VideoChat2
Question Answering	STAR Benchmark	Accuracy	59	VideoChat2
Question Answering	MSVD-QA	Accuracy	70	VideoChat2
Question Answering	MSVD-QA	Confidence Score	3.9	VideoChat2
Question Answering	MSRVTT-QA	Accuracy	54.1	VideoChat2
Question Answering	MSRVTT-QA	Confidence Score	3.3	VideoChat2
Question Answering	TVQA	Accuracy	50.6	VideoChat_HD_mistral (no speech)
Question Answering	TVQA	Accuracy	46.4	VideoChat_mistral (no speech)
Question Answering	TVQA	Accuracy	40.6	VideoChat2 (no speech)
Question Answering	EgoSchema (fullset)	Accuracy	56.7	VideoChat2_phi3
Question Answering	EgoSchema (fullset)	Accuracy	55.8	VideoChat2_HD_mistral
Question Answering	EgoSchema (fullset)	Accuracy	54.4	VideoChat2_mistral
Question Answering	EgoSchema (subset)	Accuracy	65.6	VideoChat2_HD_mistral
Question Answering	EgoSchema (subset)	Accuracy	63.6	VideoChat2_mistral
Question Answering	ActivityNet-QA	Accuracy	49.1	VideoChat2
Question Answering	ActivityNet-QA	Confidence Score	3.3	VideoChat2
Visual Question Answering (VQA)	SQA3D	Exact Match	37.3	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	9.6	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	49.2	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	19.2	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	9.5	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	28.2	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.84	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.72	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.4	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.91	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.65	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	mean	3.1	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.81	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.51	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.02	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.88	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.66	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	mean	2.98	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.64	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.51	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.02	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.88	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.86	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.66	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.65	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.81	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.62	VideoChat2_HD_mistral
Video Question Answering	TVBench	Average Accuracy	35	VideoChat2
Video Question Answering	ActivityNet-QA	Accuracy	49.1	VideoChat2
Video Question Answering	ActivityNet-QA	Confidence score	3.3	VideoChat2
Video Question Answering	NExT-QA	Accuracy	79.5	VideoChat2_HD_mistral
Video Question Answering	NExT-QA	Accuracy	78.6	VideoChat2_mistral
Video Question Answering	NExT-QA	Accuracy	68.6	VideoChat2
Video Question Answering	IntentQA	Accuarcy	83.4	VideoChat2_HD_mistral
Video Question Answering	IntentQA	CH	90	VideoChat2_HD_mistral
Video Question Answering	IntentQA	CW	84	VideoChat2_HD_mistral
Video Question Answering	IntentQA	TP&TN	77.3	VideoChat2_HD_mistral
Video Question Answering	IntentQA	Accuarcy	81.9	VideoChat2_mistral
Video Question Answering	IntentQA	CH	86.9	VideoChat2_mistral
Video Question Answering	IntentQA	CW	82.6	VideoChat2_mistral
Video Question Answering	IntentQA	TP&TN	77	VideoChat2_mistral
Video Question Answering	MVBench	Avg.	51.9	VideoChat2
Video Question Answering	NExT-QA	Accuracy	61.7	VideoChat2
Video Question Answering	STAR Benchmark	Accuracy	59	VideoChat2
Video Question Answering	MSVD-QA	Accuracy	70	VideoChat2
Video Question Answering	MSVD-QA	Confidence Score	3.9	VideoChat2
Video Question Answering	MSRVTT-QA	Accuracy	54.1	VideoChat2
Video Question Answering	MSRVTT-QA	Confidence Score	3.3	VideoChat2
Video Question Answering	TVQA	Accuracy	50.6	VideoChat_HD_mistral (no speech)
Video Question Answering	TVQA	Accuracy	46.4	VideoChat_mistral (no speech)
Video Question Answering	TVQA	Accuracy	40.6	VideoChat2 (no speech)
Video Question Answering	EgoSchema (fullset)	Accuracy	56.7	VideoChat2_phi3
Video Question Answering	EgoSchema (fullset)	Accuracy	55.8	VideoChat2_HD_mistral
Video Question Answering	EgoSchema (fullset)	Accuracy	54.4	VideoChat2_mistral
Video Question Answering	EgoSchema (subset)	Accuracy	65.6	VideoChat2_HD_mistral
Video Question Answering	EgoSchema (subset)	Accuracy	63.6	VideoChat2_mistral
Video Question Answering	ActivityNet-QA	Accuracy	49.1	VideoChat2
Video Question Answering	ActivityNet-QA	Confidence Score	3.3	VideoChat2
Generative Visual Question Answering	VideoInstruct	Consistency	2.84	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.72	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.4	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.91	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.65	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	mean	3.1	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Consistency	2.81	VideoChat2
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.51	VideoChat2
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.02	VideoChat2
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.88	VideoChat2
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.66	VideoChat2
Generative Visual Question Answering	VideoInstruct	mean	2.98	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	3.64	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	3.51	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	3.02	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.88	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.86	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	2.66	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.65	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	2.81	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.62	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	3.02	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.84	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.72	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.4	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.91	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.65	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.1	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.81	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.51	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.02	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.88	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.66	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	mean	2.98	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.64	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.51	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.02	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.88	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.86	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.66	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.65	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.81	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.62	VideoChat2_HD_mistral
VCGBench-Diverse	VideoInstruct	Consistency	2.27	VideoChat2
VCGBench-Diverse	VideoInstruct	Contextual Understanding	2.51	VideoChat2
VCGBench-Diverse	VideoInstruct	Correctness of Information	2.13	VideoChat2
VCGBench-Diverse	VideoInstruct	Dense Captioning	1.26	VideoChat2
VCGBench-Diverse	VideoInstruct	Detail Orientation	2.42	VideoChat2
VCGBench-Diverse	VideoInstruct	Reasoning	3.13	VideoChat2
VCGBench-Diverse	VideoInstruct	Spatial Understanding	2.43	VideoChat2
VCGBench-Diverse	VideoInstruct	Temporal Understanding	1.66	VideoChat2
VCGBench-Diverse	VideoInstruct	mean	2.2	VideoChat2

Abstract

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	TVQA	Accuracy	40.6	VideoChat2
Question Answering	NExT-QA	Accuracy	61.7	VideoChat2
Question Answering	STAR Benchmark	Accuracy	59	VideoChat2
Question Answering	MSVD-QA	Accuracy	70	VideoChat2
Question Answering	MSVD-QA	Confidence Score	3.9	VideoChat2
Question Answering	MSRVTT-QA	Accuracy	54.1	VideoChat2
Question Answering	MSRVTT-QA	Confidence Score	3.3	VideoChat2
Question Answering	TVQA	Accuracy	50.6	VideoChat_HD_mistral (no speech)
Question Answering	TVQA	Accuracy	46.4	VideoChat_mistral (no speech)
Question Answering	TVQA	Accuracy	40.6	VideoChat2 (no speech)
Question Answering	EgoSchema (fullset)	Accuracy	56.7	VideoChat2_phi3
Question Answering	EgoSchema (fullset)	Accuracy	55.8	VideoChat2_HD_mistral
Question Answering	EgoSchema (fullset)	Accuracy	54.4	VideoChat2_mistral
Question Answering	EgoSchema (subset)	Accuracy	65.6	VideoChat2_HD_mistral
Question Answering	EgoSchema (subset)	Accuracy	63.6	VideoChat2_mistral
Question Answering	ActivityNet-QA	Accuracy	49.1	VideoChat2
Question Answering	ActivityNet-QA	Confidence Score	3.3	VideoChat2
Visual Question Answering (VQA)	SQA3D	Exact Match	37.3	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	9.6	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	49.2	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	Exact Match	19.2	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	9.5	VideoChat2
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	28.2	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.84	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.72	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.4	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.91	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.65	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	mean	3.1	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.81	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.51	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	3.02	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.88	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.66	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	mean	2.98	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.64	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.51	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.02	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.88	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.86	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.66	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.65	VideoChat2_HD_mistral
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.81	VideoChat2
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.62	VideoChat2_HD_mistral
Video Question Answering	TVBench	Average Accuracy	35	VideoChat2
Video Question Answering	ActivityNet-QA	Accuracy	49.1	VideoChat2
Video Question Answering	ActivityNet-QA	Confidence score	3.3	VideoChat2
Video Question Answering	NExT-QA	Accuracy	79.5	VideoChat2_HD_mistral
Video Question Answering	NExT-QA	Accuracy	78.6	VideoChat2_mistral
Video Question Answering	NExT-QA	Accuracy	68.6	VideoChat2
Video Question Answering	IntentQA	Accuarcy	83.4	VideoChat2_HD_mistral
Video Question Answering	IntentQA	CH	90	VideoChat2_HD_mistral
Video Question Answering	IntentQA	CW	84	VideoChat2_HD_mistral
Video Question Answering	IntentQA	TP&TN	77.3	VideoChat2_HD_mistral
Video Question Answering	IntentQA	Accuarcy	81.9	VideoChat2_mistral
Video Question Answering	IntentQA	CH	86.9	VideoChat2_mistral
Video Question Answering	IntentQA	CW	82.6	VideoChat2_mistral
Video Question Answering	IntentQA	TP&TN	77	VideoChat2_mistral
Video Question Answering	MVBench	Avg.	51.9	VideoChat2
Video Question Answering	NExT-QA	Accuracy	61.7	VideoChat2
Video Question Answering	STAR Benchmark	Accuracy	59	VideoChat2
Video Question Answering	MSVD-QA	Accuracy	70	VideoChat2
Video Question Answering	MSVD-QA	Confidence Score	3.9	VideoChat2
Video Question Answering	MSRVTT-QA	Accuracy	54.1	VideoChat2
Video Question Answering	MSRVTT-QA	Confidence Score	3.3	VideoChat2
Video Question Answering	TVQA	Accuracy	50.6	VideoChat_HD_mistral (no speech)
Video Question Answering	TVQA	Accuracy	46.4	VideoChat_mistral (no speech)
Video Question Answering	TVQA	Accuracy	40.6	VideoChat2 (no speech)
Video Question Answering	EgoSchema (fullset)	Accuracy	56.7	VideoChat2_phi3
Video Question Answering	EgoSchema (fullset)	Accuracy	55.8	VideoChat2_HD_mistral
Video Question Answering	EgoSchema (fullset)	Accuracy	54.4	VideoChat2_mistral
Video Question Answering	EgoSchema (subset)	Accuracy	65.6	VideoChat2_HD_mistral
Video Question Answering	EgoSchema (subset)	Accuracy	63.6	VideoChat2_mistral
Video Question Answering	ActivityNet-QA	Accuracy	49.1	VideoChat2
Video Question Answering	ActivityNet-QA	Confidence Score	3.3	VideoChat2
Generative Visual Question Answering	VideoInstruct	Consistency	2.84	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.72	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.4	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.91	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.65	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	mean	3.1	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	Consistency	2.81	VideoChat2
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.51	VideoChat2
Generative Visual Question Answering	VideoInstruct	Correctness of Information	3.02	VideoChat2
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.88	VideoChat2
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.66	VideoChat2
Generative Visual Question Answering	VideoInstruct	mean	2.98	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	3.64	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	3.51	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	3.02	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.88	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.86	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	2.66	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.65	VideoChat2_HD_mistral
Generative Visual Question Answering	VideoInstruct	gpt-score	2.81	VideoChat2
Generative Visual Question Answering	VideoInstruct	gpt-score	2.62	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	3.02	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.84	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.72	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.4	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.91	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.65	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	mean	3.1	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.81	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.51	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	3.02	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.88	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.66	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	mean	2.98	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.64	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.51	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.4	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.02	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.88	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.86	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.66	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.65	VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.81	VideoChat2
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.62	VideoChat2_HD_mistral
VCGBench-Diverse	VideoInstruct	Consistency	2.27	VideoChat2
VCGBench-Diverse	VideoInstruct	Contextual Understanding	2.51	VideoChat2
VCGBench-Diverse	VideoInstruct	Correctness of Information	2.13	VideoChat2
VCGBench-Diverse	VideoInstruct	Dense Captioning	1.26	VideoChat2
VCGBench-Diverse	VideoInstruct	Detail Orientation	2.42	VideoChat2
VCGBench-Diverse	VideoInstruct	Reasoning	3.13	VideoChat2
VCGBench-Diverse	VideoInstruct	Spatial Understanding	2.43	VideoChat2
VCGBench-Diverse	VideoInstruct	Temporal Understanding	1.66	VideoChat2
VCGBench-Diverse	VideoInstruct	mean	2.2	VideoChat2

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Abstract

Results

Related Papers

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Abstract

Results

Related Papers