TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MovieChat: From Dense Token to Sparse Memory for Long Vide...

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

2023-07-31CVPR 2024 1Zero-Shot Video Question AnswerQuestion AnsweringVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)zero-shot long video question answeringVideo-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video UnderstandingMultiple-choice
PaperPDFCode(official)

Abstract

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QA (Open-ended VideoQA)Accuracy49.9MovieChat
Question AnsweringNExT-QA (Open-ended VideoQA)Confidence Score2.7MovieChat
Question AnsweringMSVD-QAAccuracy75.2MovieChat
Question AnsweringMSVD-QAConfidence Score2.9MovieChat
Question AnsweringMSRVTT-QAAccuracy52.7MovieChat
Question AnsweringMSRVTT-QAConfidence Score2.6MovieChat
Question AnsweringActivityNet-QAAccuracy45.7MovieChat
Question AnsweringActivityNet-QAConfidence Score3.1MovieChat
Visual Question Answering (VQA)VideoInstructgpt-score3.01MovieChat
Visual Question Answering (VQA)VideoInstructgpt-score2.76MovieChat
Visual Question Answering (VQA)VideoInstructgpt-score2.93MovieChat
Visual Question Answering (VQA)VideoInstructgpt-score2.24MovieChat
Visual Question Answering (VQA)VideoInstructgpt-score2.42MovieChat
Video Question AnsweringOVBenchAVG30.9MovieChat (7B)
Video Question AnsweringActivityNet-QAAccuracy45.7MovieChat
Video Question AnsweringActivityNet-QAConfidence score3.1MovieChat
Video Question AnsweringMSVD-QAAccuracy75.2MovieChat
Video Question AnsweringMSVD-QAConfidence Score2.9MovieChat
Video Question AnsweringMSRVTT-QAAccuracy52.7MovieChat
Video Question AnsweringMSRVTT-QAConfidence Score2.6MovieChat
Video Question AnsweringActivityNet-QAAccuracy45.7MovieChat
Video Question AnsweringActivityNet-QAConfidence Score3.1MovieChat
Generative Visual Question AnsweringVideoInstructgpt-score3.01MovieChat
Generative Visual Question AnsweringVideoInstructgpt-score2.76MovieChat
Generative Visual Question AnsweringVideoInstructgpt-score2.93MovieChat
Generative Visual Question AnsweringVideoInstructgpt-score2.24MovieChat
Generative Visual Question AnsweringVideoInstructgpt-score2.42MovieChat
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score2.76MovieChat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.01MovieChat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.76MovieChat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.93MovieChat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.24MovieChat
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.42MovieChat

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16