TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MA-LMM: Memory-Augmented Large Multimodal Model for Long-T...

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim

2024-04-08CVPR 2024 1Question AnsweringVideo Question AnsweringVideo CaptioningVideo ClassificationVideo UnderstandingVisual Question Answering (VQA)Temporal Relation ExtractionMultiple-choice
PaperPDFCode(official)

Abstract

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score6.8MA-LMM-Vicuna-7B
Relation ExtractionVinogroundText Score23.8MA-LMM-Vicuna-7B
Relation ExtractionVinogroundVideo Score25.6MA-LMM-Vicuna-7B
VideoBreakfastAccuracy (%)93MA-LMM
VideoCOINAccuracy (%)93.2MA-LMM
Visual Question Answering (VQA)MSVD-QAAccuracy0.606MA-LMM
Video Question AnsweringActivityNet-QAAccuracy49.8MA-LMM
Video Question AnsweringMSRVTT-QAAccuracy48.5MA-LMM
Video CaptioningYouCook2CIDEr1.31MA-LMM
Video CaptioningYouCook2METEOR17.6MA-LMM
Temporal Relation ExtractionVinogroundGroup Score6.8MA-LMM-Vicuna-7B
Temporal Relation ExtractionVinogroundText Score23.8MA-LMM-Vicuna-7B
Temporal Relation ExtractionVinogroundVideo Score25.6MA-LMM-Vicuna-7B
Video ClassificationBreakfastAccuracy (%)93MA-LMM
Video ClassificationCOINAccuracy (%)93.2MA-LMM

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17