MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim

2024-04-08CVPR 2024 1Question Answering Video Question Answering Video Captioning Video Classification Video Understanding Visual Question Answering (VQA)Temporal Relation Extraction Multiple-choice

Paper PDF Code(official)

Abstract

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	6.8	MA-LMM-Vicuna-7B
Relation Extraction	Vinoground	Text Score	23.8	MA-LMM-Vicuna-7B
Relation Extraction	Vinoground	Video Score	25.6	MA-LMM-Vicuna-7B
Video	Breakfast	Accuracy (%)	93	MA-LMM
Video	COIN	Accuracy (%)	93.2	MA-LMM
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.606	MA-LMM
Video Question Answering	ActivityNet-QA	Accuracy	49.8	MA-LMM
Video Question Answering	MSRVTT-QA	Accuracy	48.5	MA-LMM
Video Captioning	YouCook2	CIDEr	1.31	MA-LMM
Video Captioning	YouCook2	METEOR	17.6	MA-LMM
Temporal Relation Extraction	Vinoground	Group Score	6.8	MA-LMM-Vicuna-7B
Temporal Relation Extraction	Vinoground	Text Score	23.8	MA-LMM-Vicuna-7B
Temporal Relation Extraction	Vinoground	Video Score	25.6	MA-LMM-Vicuna-7B
Video Classification	Breakfast	Accuracy (%)	93	MA-LMM
Video Classification	COIN	Accuracy (%)	93.2	MA-LMM

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Abstract

Results

Related Papers

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Abstract

Results

Related Papers