VideoChat: Chat-Centric Video Understanding

Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao

2023-05-10Zero-Shot Video Question Answer Question Answering Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Contextual Understanding)Video-based Generative Performance Benchmarking (Correctness of Information)Video Question Answering Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding

Paper PDF Code(official)

Abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA (Open-ended VideoQA)	Accuracy	56.6	VideoChat
Question Answering	NExT-QA (Open-ended VideoQA)	Confidence Score	3.2	VideoChat
Question Answering	VNBench	Accuracy	12.4	VideoChat2
Question Answering	MSVD-QA	Accuracy	56.3	Video Chat-7B
Question Answering	MSVD-QA	Confidence Score	2.8	Video Chat-7B
Question Answering	TGIF-QA	Accuracy	34.4	Video Chat-7B
Question Answering	TGIF-QA	Confidence Score	2.3	Video Chat-7B
Question Answering	MSRVTT-QA	Accuracy	45	Video Chat-7B
Question Answering	MSRVTT-QA	Confidence Score	2.5	Video Chat-7B
Question Answering	ActivityNet-QA	Accuracy	26.5	Video Chat
Question Answering	ActivityNet-QA	Confidence Score	2.2	Video Chat
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.24	Video Chat
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	2.53	Video Chat
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	2.23	Video Chat
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.5	Video Chat
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	1.94	Video Chat
Visual Question Answering (VQA)	VideoInstruct	mean	2.29	Video Chat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.53	Video Chat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.32	Video Chat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.5	Video Chat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	1.94	Video Chat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.24	Video Chat
Video Question Answering	ActivityNet-QA	Accuracy	26.5	Video Chat
Video Question Answering	ActivityNet-QA	Confidence score	2.2	Video Chat
Video Question Answering	MVBench	Avg.	35.5	VideoChat
Video Question Answering	VNBench	Accuracy	12.4	VideoChat2
Video Question Answering	MSVD-QA	Accuracy	56.3	Video Chat-7B
Video Question Answering	MSVD-QA	Confidence Score	2.8	Video Chat-7B
Video Question Answering	TGIF-QA	Accuracy	34.4	Video Chat-7B
Video Question Answering	TGIF-QA	Confidence Score	2.3	Video Chat-7B
Video Question Answering	MSRVTT-QA	Accuracy	45	Video Chat-7B
Video Question Answering	MSRVTT-QA	Confidence Score	2.5	Video Chat-7B
Video Question Answering	ActivityNet-QA	Accuracy	26.5	Video Chat
Video Question Answering	ActivityNet-QA	Confidence Score	2.2	Video Chat
Generative Visual Question Answering	VideoInstruct	Consistency	2.24	Video Chat
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	2.53	Video Chat
Generative Visual Question Answering	VideoInstruct	Correctness of Information	2.23	Video Chat
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.5	Video Chat
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	1.94	Video Chat
Generative Visual Question Answering	VideoInstruct	mean	2.29	Video Chat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.53	Video Chat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.32	Video Chat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.5	Video Chat
Generative Visual Question Answering	VideoInstruct	gpt-score	1.94	Video Chat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.24	Video Chat
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	2.32	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.24	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	2.53	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	2.23	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.5	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	1.94	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	mean	2.29	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.53	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.32	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.5	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	1.94	Video Chat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.24	Video Chat

VideoChat: Chat-Centric Video Understanding

Abstract

Results

Related Papers

VideoChat: Chat-Centric Video Understanding

Abstract

Results

Related Papers