LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra

2024-10-22Zero-Shot Video Question Answer Video Question Answering Video Understanding

Paper PDF Code(official)

Abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Results

Task	Dataset	Metric	Value	Model
Question Answering	Video-MME	Accuracy (%)	60.6	LongVU (7B)
Question Answering	EgoSchema (fullset)	Accuracy	67.6	LongVU (7B)
Video Question Answering	MVBench	Avg.	66.9	LongVU (7B)
Video Question Answering	Video-MME	Accuracy (%)	60.6	LongVU (7B)
Video Question Answering	EgoSchema (fullset)	Accuracy	67.6	LongVU (7B)

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Abstract

Results

Related Papers

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Abstract

Results

Related Papers