TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video-ChatGPT: Towards Detailed Video Understanding via La...

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan

2023-06-08Zero-Shot Video Question AnswerVCGBench-DiverseQuestion AnsweringVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding
PaperPDFCode(official)Code

Abstract

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of \emph{video-based conversation} by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QA (Open-ended VideoQA)Accuracy54.6Video-ChatGPT
Question AnsweringNExT-QA (Open-ended VideoQA)Confidence Score3.2Video-ChatGPT
Question AnsweringVNBenchAccuracy4.1VideoChatGPT
Question AnsweringMSVD-QAAccuracy64.9Video-ChatGPT-7B
Question AnsweringMSVD-QAConfidence Score3.3Video-ChatGPT-7B
Question AnsweringTGIF-QAAccuracy51.4Video-ChatGPT-7B
Question AnsweringTGIF-QAConfidence Score3Video-ChatGPT-7B
Question AnsweringMSRVTT-QAAccuracy49.3Video-ChatGPT-7B
Question AnsweringMSRVTT-QAConfidence Score2.8Video-ChatGPT-7B
Question AnsweringActivityNet-QAAccuracy35.2Video-ChatGPT
Question AnsweringActivityNet-QAConfidence Score2.7Video-ChatGPT
Visual Question Answering (VQA)VideoInstructConsistency2.37Video-ChatGPT
Visual Question Answering (VQA)VideoInstructContextual Understanding2.62Video-ChatGPT
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.4Video-ChatGPT
Visual Question Answering (VQA)VideoInstructDetail Orientation2.52Video-ChatGPT
Visual Question Answering (VQA)VideoInstructTemporal Understanding1.98Video-ChatGPT
Visual Question Answering (VQA)VideoInstructmean2.38Video-ChatGPT
Visual Question Answering (VQA)VideoInstructgpt-score2.62Video-ChatGPT
Visual Question Answering (VQA)VideoInstructgpt-score2.4Video-ChatGPT
Visual Question Answering (VQA)VideoInstructgpt-score2.52Video-ChatGPT
Visual Question Answering (VQA)VideoInstructgpt-score1.98Video-ChatGPT
Visual Question Answering (VQA)VideoInstructgpt-score2.37Video-ChatGPT
Video Question AnsweringActivityNet-QAAccuracy35.2Video-ChatGPT
Video Question AnsweringActivityNet-QAConfidence score2.7Video-ChatGPT
Video Question AnsweringMVBenchAvg.32.7Video-ChatGPT
Video Question AnsweringVNBenchAccuracy4.1VideoChatGPT
Video Question AnsweringMSVD-QAAccuracy64.9Video-ChatGPT-7B
Video Question AnsweringMSVD-QAConfidence Score3.3Video-ChatGPT-7B
Video Question AnsweringTGIF-QAAccuracy51.4Video-ChatGPT-7B
Video Question AnsweringTGIF-QAConfidence Score3Video-ChatGPT-7B
Video Question AnsweringMSRVTT-QAAccuracy49.3Video-ChatGPT-7B
Video Question AnsweringMSRVTT-QAConfidence Score2.8Video-ChatGPT-7B
Video Question AnsweringActivityNet-QAAccuracy35.2Video-ChatGPT
Video Question AnsweringActivityNet-QAConfidence Score2.7Video-ChatGPT
Generative Visual Question AnsweringVideoInstructConsistency2.37Video-ChatGPT
Generative Visual Question AnsweringVideoInstructContextual Understanding2.62Video-ChatGPT
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.4Video-ChatGPT
Generative Visual Question AnsweringVideoInstructDetail Orientation2.52Video-ChatGPT
Generative Visual Question AnsweringVideoInstructTemporal Understanding1.98Video-ChatGPT
Generative Visual Question AnsweringVideoInstructmean2.38Video-ChatGPT
Generative Visual Question AnsweringVideoInstructgpt-score2.62Video-ChatGPT
Generative Visual Question AnsweringVideoInstructgpt-score2.4Video-ChatGPT
Generative Visual Question AnsweringVideoInstructgpt-score2.52Video-ChatGPT
Generative Visual Question AnsweringVideoInstructgpt-score1.98Video-ChatGPT
Generative Visual Question AnsweringVideoInstructgpt-score2.37Video-ChatGPT
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score2.4Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.37Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding2.62Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.4Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.52Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding1.98Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructmean2.38Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.62Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.4Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.52Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructgpt-score1.98Video-ChatGPT
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.37Video-ChatGPT
VCGBench-DiverseVideoInstructConsistency2.06Video-ChatGPT
VCGBench-DiverseVideoInstructContextual Understanding2.46Video-ChatGPT
VCGBench-DiverseVideoInstructCorrectness of Information2.07Video-ChatGPT
VCGBench-DiverseVideoInstructDense Captioning0.89Video-ChatGPT
VCGBench-DiverseVideoInstructDetail Orientation2.42Video-ChatGPT
VCGBench-DiverseVideoInstructReasoning3.6Video-ChatGPT
VCGBench-DiverseVideoInstructSpatial Understanding2.25Video-ChatGPT
VCGBench-DiverseVideoInstructTemporal Understanding1.39Video-ChatGPT
VCGBench-DiverseVideoInstructmean2.08Video-ChatGPT

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15