TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoGPT+: Integrating Image and Video Encoders for Enhanc...

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Khan

2024-06-13Zero-Shot Video Question AnswerVCGBench-DiverseQuestion AnsweringVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo CaptioningVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Dense Video CaptioningVideo-based Generative Performance Benchmarking (Detail Orientation))Video Understanding
PaperPDFCode(official)

Abstract

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either image or video encoders to process visual inputs, each of which has its own limitations. Image encoders excel at capturing rich spatial details from frame sequences but lack explicit temporal context, which can be important in videos with intricate action sequences. On the other hand, video encoders provide temporal context but are often limited by computational constraints that lead to processing only sparse frames at lower resolutions, resulting in reduced contextual and spatial understanding. To this end, we introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling). The model processes videos by dividing them into smaller segments and applies an adaptive pooling strategy on features extracted by both image and video encoders. Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering. Further, we develop 112K video-instruction set using a novel semi-automatic annotation pipeline which further improves the model performance. Additionally, to comprehensively evaluate video LMMs, we present VCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports, science, gaming, and surveillance videos. This benchmark with 4,354 question-answer pairs evaluates the generalization of existing LMMs on dense video captioning, spatial and temporal understanding, and complex reasoning, ensuring comprehensive assessment across diverse video types and dynamics. Code: https://github.com/mbzuai-oryx/VideoGPT-plus.

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy72.4VideoGPT+
Question AnsweringMSVD-QAConfidence Score3.6VideoGPT+
Question AnsweringTGIF-QAAccuracy74.6VideoGPT+
Question AnsweringTGIF-QAConfidence Score4.1VideoGPT+
Question AnsweringMSRVTT-QAAccuracy60.6VideoGPT+
Question AnsweringMSRVTT-QAConfidence Score3.6VideoGPT+
Question AnsweringActivityNet-QAAccuracy50.6VideoGPT+
Question AnsweringActivityNet-QAConfidence Score3.6VideoGPT+
Visual Question Answering (VQA)VideoInstructConsistency3.39VideoGPT+
Visual Question Answering (VQA)VideoInstructContextual Understanding3.74VideoGPT+
Visual Question Answering (VQA)VideoInstructCorrectness of Information3.27VideoGPT+
Visual Question Answering (VQA)VideoInstructDetail Orientation3.18VideoGPT+
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.83VideoGPT+
Visual Question Answering (VQA)VideoInstructmean3.28VideoGPT+
Visual Question Answering (VQA)VideoInstructgpt-score3.74VideoGPT+
Visual Question Answering (VQA)VideoInstructgpt-score3.27VideoGPT+
Visual Question Answering (VQA)VideoInstructgpt-score3.18VideoGPT+
Visual Question Answering (VQA)VideoInstructgpt-score2.83VideoGPT+
Visual Question Answering (VQA)VideoInstructgpt-score3.39VideoGPT+
Video Question AnsweringTVBenchAverage Accuracy41.7VideoGPT+
Video Question AnsweringMVBenchAvg.58.7VideoGPT+
Video Question AnsweringMSVD-QAAccuracy72.4VideoGPT+
Video Question AnsweringMSVD-QAConfidence Score3.6VideoGPT+
Video Question AnsweringTGIF-QAAccuracy74.6VideoGPT+
Video Question AnsweringTGIF-QAConfidence Score4.1VideoGPT+
Video Question AnsweringMSRVTT-QAAccuracy60.6VideoGPT+
Video Question AnsweringMSRVTT-QAConfidence Score3.6VideoGPT+
Video Question AnsweringActivityNet-QAAccuracy50.6VideoGPT+
Video Question AnsweringActivityNet-QAConfidence Score3.6VideoGPT+
Generative Visual Question AnsweringVideoInstructConsistency3.39VideoGPT+
Generative Visual Question AnsweringVideoInstructContextual Understanding3.74VideoGPT+
Generative Visual Question AnsweringVideoInstructCorrectness of Information3.27VideoGPT+
Generative Visual Question AnsweringVideoInstructDetail Orientation3.18VideoGPT+
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.83VideoGPT+
Generative Visual Question AnsweringVideoInstructmean3.28VideoGPT+
Generative Visual Question AnsweringVideoInstructgpt-score3.74VideoGPT+
Generative Visual Question AnsweringVideoInstructgpt-score3.27VideoGPT+
Generative Visual Question AnsweringVideoInstructgpt-score3.18VideoGPT+
Generative Visual Question AnsweringVideoInstructgpt-score2.83VideoGPT+
Generative Visual Question AnsweringVideoInstructgpt-score3.39VideoGPT+
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score3.27VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructConsistency3.39VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.74VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information3.27VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3.18VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.83VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructmean3.28VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.74VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.27VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.18VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.83VideoGPT+
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.39VideoGPT+
VCGBench-DiverseVideoInstructConsistency2.59VideoGPT+
VCGBench-DiverseVideoInstructContextual Understanding2.81VideoGPT+
VCGBench-DiverseVideoInstructCorrectness of Information2.46VideoGPT+
VCGBench-DiverseVideoInstructDense Captioning1.38VideoGPT+
VCGBench-DiverseVideoInstructDetail Orientation2.73VideoGPT+
VCGBench-DiverseVideoInstructReasoning3.63VideoGPT+
VCGBench-DiverseVideoInstructSpatial Understanding2.8VideoGPT+
VCGBench-DiverseVideoInstructTemporal Understanding1.78VideoGPT+
VCGBench-DiverseVideoInstructmean2.47VideoGPT+

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15