VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

2024-06-11Zero-Shot Video Question Answer Question Answering Video Question Answering Video Captioning Visual Question Answering (VQA)Temporal Relation Extraction Multiple-choice

Paper PDF Code(official)Code Code

Abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	8.4	VideoLLaMA2-72B
Relation Extraction	Vinoground	Text Score	36.2	VideoLLaMA2-72B
Relation Extraction	Vinoground	Video Score	21.8	VideoLLaMA2-72B
Question Answering	Video-MME (w/o subs)	Accuracy (%)	60.9	VideoLLaMA2 (72B)
Question Answering	Video-MME	Accuracy (%)	63.1	VideoLLaMA2 (72B)
Question Answering	VNBench	Accuracy	4.5	VideoLLaMA2
Question Answering	EgoSchema (fullset)	Accuracy	63.9	VideoLLaMA2 (72B)
Video Question Answering	TVBench	Average Accuracy	48.4	VideoLLaMA2 72B
Video Question Answering	TVBench	Average Accuracy	42.9	VideoLLaMA2 7B
Video Question Answering	TVBench	Average Accuracy	42.1	VideoLLaMA2.1
Video Question Answering	NExT-QA	Accuracy	75.6	VideoLLaMA2.1(7B)
Video Question Answering	Perception Test	Accuracy (Top-1)	57.5	VideoLLaMA2 (72B)
Video Question Answering	MVBench	Avg.	62	VideoLLaMA2 (72B)
Video Question Answering	Video-MME (w/o subs)	Accuracy (%)	60.9	VideoLLaMA2 (72B)
Video Question Answering	Video-MME	Accuracy (%)	63.1	VideoLLaMA2 (72B)
Video Question Answering	VNBench	Accuracy	4.5	VideoLLaMA2
Video Question Answering	EgoSchema (fullset)	Accuracy	63.9	VideoLLaMA2 (72B)
Temporal Relation Extraction	Vinoground	Group Score	8.4	VideoLLaMA2-72B
Temporal Relation Extraction	Vinoground	Text Score	36.2	VideoLLaMA2-72B
Temporal Relation Extraction	Vinoground	Video Score	21.8	VideoLLaMA2-72B

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Abstract

Results

Related Papers

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Abstract

Results

Related Papers