TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MVBench: A Comprehensive Multi-modal Video Understanding B...

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

2023-11-28CVPR 2024 1FairnessZero-Shot Video Question AnswerVCGBench-DiverseVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)3D Question Answering (3D-QA)DiagnosticVideo-based Generative Performance Benchmarking (Detail Orientation))Video UnderstandingZero-Shot LearningMultiple-choice
PaperPDFCodeCode(official)Code

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

Results

TaskDatasetMetricValueModel
Zero-Shot LearningTVQAAccuracy40.6VideoChat2
Question AnsweringNExT-QAAccuracy61.7VideoChat2
Question AnsweringSTAR BenchmarkAccuracy59VideoChat2
Question AnsweringMSVD-QAAccuracy70VideoChat2
Question AnsweringMSVD-QAConfidence Score3.9VideoChat2
Question AnsweringMSRVTT-QAAccuracy54.1VideoChat2
Question AnsweringMSRVTT-QAConfidence Score3.3VideoChat2
Question AnsweringTVQAAccuracy50.6VideoChat_HD_mistral (no speech)
Question AnsweringTVQAAccuracy46.4VideoChat_mistral (no speech)
Question AnsweringTVQAAccuracy40.6VideoChat2 (no speech)
Question AnsweringEgoSchema (fullset)Accuracy56.7VideoChat2_phi3
Question AnsweringEgoSchema (fullset)Accuracy55.8VideoChat2_HD_mistral
Question AnsweringEgoSchema (fullset)Accuracy54.4VideoChat2_mistral
Question AnsweringEgoSchema (subset)Accuracy65.6VideoChat2_HD_mistral
Question AnsweringEgoSchema (subset)Accuracy63.6VideoChat2_mistral
Question AnsweringActivityNet-QAAccuracy49.1VideoChat2
Question AnsweringActivityNet-QAConfidence Score3.3VideoChat2
Visual Question Answering (VQA)SQA3DExact Match37.3VideoChat2
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-49.6VideoChat2
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr49.2VideoChat2
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match19.2VideoChat2
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR9.5VideoChat2
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE28.2VideoChat2
Visual Question Answering (VQA)VideoInstructConsistency2.84VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructContextual Understanding3.72VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.4VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructDetail Orientation2.91VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.65VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructmean3.1VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructConsistency2.81VideoChat2
Visual Question Answering (VQA)VideoInstructContextual Understanding3.51VideoChat2
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.02VideoChat2
Visual Question Answering (VQA)VideoInstructDetail Orientation2.88VideoChat2
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.66VideoChat2
Visual Question Answering (VQA)VideoInstructmean2.98VideoChat2
Visual Question Answering (VQA)VideoInstructgpt-score3.64VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructgpt-score3.51VideoChat2
Visual Question Answering (VQA)VideoInstructgpt-score3.4VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructgpt-score3.02VideoChat2
Visual Question Answering (VQA)VideoInstructgpt-score2.88VideoChat2
Visual Question Answering (VQA)VideoInstructgpt-score2.86VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructgpt-score2.66VideoChat2
Visual Question Answering (VQA)VideoInstructgpt-score2.65VideoChat2_HD_mistral
Visual Question Answering (VQA)VideoInstructgpt-score2.81VideoChat2
Visual Question Answering (VQA)VideoInstructgpt-score2.62VideoChat2_HD_mistral
Video Question AnsweringTVBenchAverage Accuracy35VideoChat2
Video Question AnsweringActivityNet-QAAccuracy49.1VideoChat2
Video Question AnsweringActivityNet-QAConfidence score3.3VideoChat2
Video Question AnsweringNExT-QAAccuracy79.5VideoChat2_HD_mistral
Video Question AnsweringNExT-QAAccuracy78.6VideoChat2_mistral
Video Question AnsweringNExT-QAAccuracy68.6VideoChat2
Video Question AnsweringIntentQAAccuarcy83.4VideoChat2_HD_mistral
Video Question AnsweringIntentQACH90VideoChat2_HD_mistral
Video Question AnsweringIntentQACW84VideoChat2_HD_mistral
Video Question AnsweringIntentQATP&TN77.3VideoChat2_HD_mistral
Video Question AnsweringIntentQAAccuarcy81.9VideoChat2_mistral
Video Question AnsweringIntentQACH86.9VideoChat2_mistral
Video Question AnsweringIntentQACW82.6VideoChat2_mistral
Video Question AnsweringIntentQATP&TN77VideoChat2_mistral
Video Question AnsweringMVBenchAvg.51.9VideoChat2
Video Question AnsweringNExT-QAAccuracy61.7VideoChat2
Video Question AnsweringSTAR BenchmarkAccuracy59VideoChat2
Video Question AnsweringMSVD-QAAccuracy70VideoChat2
Video Question AnsweringMSVD-QAConfidence Score3.9VideoChat2
Video Question AnsweringMSRVTT-QAAccuracy54.1VideoChat2
Video Question AnsweringMSRVTT-QAConfidence Score3.3VideoChat2
Video Question AnsweringTVQAAccuracy50.6VideoChat_HD_mistral (no speech)
Video Question AnsweringTVQAAccuracy46.4VideoChat_mistral (no speech)
Video Question AnsweringTVQAAccuracy40.6VideoChat2 (no speech)
Video Question AnsweringEgoSchema (fullset)Accuracy56.7VideoChat2_phi3
Video Question AnsweringEgoSchema (fullset)Accuracy55.8VideoChat2_HD_mistral
Video Question AnsweringEgoSchema (fullset)Accuracy54.4VideoChat2_mistral
Video Question AnsweringEgoSchema (subset)Accuracy65.6VideoChat2_HD_mistral
Video Question AnsweringEgoSchema (subset)Accuracy63.6VideoChat2_mistral
Video Question AnsweringActivityNet-QAAccuracy49.1VideoChat2
Video Question AnsweringActivityNet-QAConfidence Score3.3VideoChat2
Generative Visual Question AnsweringVideoInstructConsistency2.84VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructContextual Understanding3.72VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.4VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructDetail Orientation2.91VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.65VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructmean3.1VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructConsistency2.81VideoChat2
Generative Visual Question AnsweringVideoInstructContextual Understanding3.51VideoChat2
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.02VideoChat2
Generative Visual Question AnsweringVideoInstructDetail Orientation2.88VideoChat2
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.66VideoChat2
Generative Visual Question AnsweringVideoInstructmean2.98VideoChat2
Generative Visual Question AnsweringVideoInstructgpt-score3.64VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructgpt-score3.51VideoChat2
Generative Visual Question AnsweringVideoInstructgpt-score3.4VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructgpt-score3.02VideoChat2
Generative Visual Question AnsweringVideoInstructgpt-score2.88VideoChat2
Generative Visual Question AnsweringVideoInstructgpt-score2.86VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructgpt-score2.66VideoChat2
Generative Visual Question AnsweringVideoInstructgpt-score2.65VideoChat2_HD_mistral
Generative Visual Question AnsweringVideoInstructgpt-score2.81VideoChat2
Generative Visual Question AnsweringVideoInstructgpt-score2.62VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.4VideoChat2_HD_mistral
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.02VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.84VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.72VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.4VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.91VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.65VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructmean3.1VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.81VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.51VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.02VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.88VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.66VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructmean2.98VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.64VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.51VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.4VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.02VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.88VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.86VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.66VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.65VideoChat2_HD_mistral
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.81VideoChat2
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.62VideoChat2_HD_mistral
VCGBench-DiverseVideoInstructConsistency2.27VideoChat2
VCGBench-DiverseVideoInstructContextual Understanding2.51VideoChat2
VCGBench-DiverseVideoInstructCorrectness of Information2.13VideoChat2
VCGBench-DiverseVideoInstructDense Captioning1.26VideoChat2
VCGBench-DiverseVideoInstructDetail Orientation2.42VideoChat2
VCGBench-DiverseVideoInstructReasoning3.13VideoChat2
VCGBench-DiverseVideoInstructSpatial Understanding2.43VideoChat2
VCGBench-DiverseVideoInstructTemporal Understanding1.66VideoChat2
VCGBench-DiverseVideoInstructmean2.2VideoChat2

Related Papers

A Reproducibility Study of Product-side Fairness in Bundle Recommendation2025-07-18Smart fault detection in satellite electrical power system2025-07-18FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient2025-07-17Demographic-aware fine-grained classification of pediatric wrist fractures2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17