MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

2023-07-31CVPR 2024 1Zero-Shot Video Question Answer Question Answering Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Contextual Understanding)zero-shot long video question answering Video-based Generative Performance Benchmarking (Correctness of Information)Video Question Answering Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding Multiple-choice

Paper PDF Code(official)

Abstract

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA (Open-ended VideoQA)	Accuracy	49.9	MovieChat
Question Answering	NExT-QA (Open-ended VideoQA)	Confidence Score	2.7	MovieChat
Question Answering	MSVD-QA	Accuracy	75.2	MovieChat
Question Answering	MSVD-QA	Confidence Score	2.9	MovieChat
Question Answering	MSRVTT-QA	Accuracy	52.7	MovieChat
Question Answering	MSRVTT-QA	Confidence Score	2.6	MovieChat
Question Answering	ActivityNet-QA	Accuracy	45.7	MovieChat
Question Answering	ActivityNet-QA	Confidence Score	3.1	MovieChat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.01	MovieChat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.76	MovieChat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.93	MovieChat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.24	MovieChat
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.42	MovieChat
Video Question Answering	OVBench	AVG	30.9	MovieChat (7B)
Video Question Answering	ActivityNet-QA	Accuracy	45.7	MovieChat
Video Question Answering	ActivityNet-QA	Confidence score	3.1	MovieChat
Video Question Answering	MSVD-QA	Accuracy	75.2	MovieChat
Video Question Answering	MSVD-QA	Confidence Score	2.9	MovieChat
Video Question Answering	MSRVTT-QA	Accuracy	52.7	MovieChat
Video Question Answering	MSRVTT-QA	Confidence Score	2.6	MovieChat
Video Question Answering	ActivityNet-QA	Accuracy	45.7	MovieChat
Video Question Answering	ActivityNet-QA	Confidence Score	3.1	MovieChat
Generative Visual Question Answering	VideoInstruct	gpt-score	3.01	MovieChat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.76	MovieChat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.93	MovieChat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.24	MovieChat
Generative Visual Question Answering	VideoInstruct	gpt-score	2.42	MovieChat
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	2.76	MovieChat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.01	MovieChat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.76	MovieChat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.93	MovieChat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.24	MovieChat
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.42	MovieChat

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Abstract

Results

Related Papers

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Abstract

Results

Related Papers