Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan

2023-06-08Zero-Shot Video Question Answer VCGBench-Diverse Question Answering Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question Answering Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding

Paper PDF Code(official)Code

Abstract

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of \emph{video-based conversation} by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA (Open-ended VideoQA)	Accuracy	54.6	Video-ChatGPT
Question Answering	NExT-QA (Open-ended VideoQA)	Confidence Score	3.2	Video-ChatGPT
Question Answering	VNBench	Accuracy	4.1	VideoChatGPT
Question Answering	MSVD-QA	Accuracy	64.9	Video-ChatGPT-7B
Question Answering	MSVD-QA	Confidence Score	3.3	Video-ChatGPT-7B
Question Answering	TGIF-QA	Accuracy	51.4	Video-ChatGPT-7B
Question Answering	TGIF-QA	Confidence Score	3	Video-ChatGPT-7B
Question Answering	MSRVTT-QA	Accuracy	49.3	Video-ChatGPT-7B
Question Answering	MSRVTT-QA	Confidence Score	2.8	Video-ChatGPT-7B
Question Answering	ActivityNet-QA	Accuracy	35.2	Video-ChatGPT
Question Answering	ActivityNet-QA	Confidence Score	2.7	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.37	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	2.62	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	2.4	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.52	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	1.98	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	mean	2.38	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.62	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.4	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.52	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	gpt-score	1.98	Video-ChatGPT
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.37	Video-ChatGPT
Video Question Answering	ActivityNet-QA	Accuracy	35.2	Video-ChatGPT
Video Question Answering	ActivityNet-QA	Confidence score	2.7	Video-ChatGPT
Video Question Answering	MVBench	Avg.	32.7	Video-ChatGPT
Video Question Answering	VNBench	Accuracy	4.1	VideoChatGPT
Video Question Answering	MSVD-QA	Accuracy	64.9	Video-ChatGPT-7B
Video Question Answering	MSVD-QA	Confidence Score	3.3	Video-ChatGPT-7B
Video Question Answering	TGIF-QA	Accuracy	51.4	Video-ChatGPT-7B
Video Question Answering	TGIF-QA	Confidence Score	3	Video-ChatGPT-7B
Video Question Answering	MSRVTT-QA	Accuracy	49.3	Video-ChatGPT-7B
Video Question Answering	MSRVTT-QA	Confidence Score	2.8	Video-ChatGPT-7B
Video Question Answering	ActivityNet-QA	Accuracy	35.2	Video-ChatGPT
Video Question Answering	ActivityNet-QA	Confidence Score	2.7	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	Consistency	2.37	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	2.62	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	Correctness of Information	2.4	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.52	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	1.98	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	mean	2.38	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	gpt-score	2.62	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	gpt-score	2.4	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	gpt-score	2.52	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	gpt-score	1.98	Video-ChatGPT
Generative Visual Question Answering	VideoInstruct	gpt-score	2.37	Video-ChatGPT
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	2.4	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.37	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	2.62	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	2.4	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.52	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	1.98	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	mean	2.38	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.62	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.4	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.52	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	1.98	Video-ChatGPT
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.37	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Consistency	2.06	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Contextual Understanding	2.46	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Correctness of Information	2.07	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Dense Captioning	0.89	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Detail Orientation	2.42	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Reasoning	3.6	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Spatial Understanding	2.25	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	Temporal Understanding	1.39	Video-ChatGPT
VCGBench-Diverse	VideoInstruct	mean	2.08	Video-ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Abstract

Results

Related Papers

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Abstract

Results

Related Papers