Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Relation Extraction | Vinoground | Group Score | 8.4 | VideoLLaMA2-72B |
| Relation Extraction | Vinoground | Text Score | 36.2 | VideoLLaMA2-72B |
| Relation Extraction | Vinoground | Video Score | 21.8 | VideoLLaMA2-72B |
| Question Answering | Video-MME (w/o subs) | Accuracy (%) | 60.9 | VideoLLaMA2 (72B) |
| Question Answering | Video-MME | Accuracy (%) | 63.1 | VideoLLaMA2 (72B) |
| Question Answering | VNBench | Accuracy | 4.5 | VideoLLaMA2 |
| Question Answering | EgoSchema (fullset) | Accuracy | 63.9 | VideoLLaMA2 (72B) |
| Video Question Answering | TVBench | Average Accuracy | 48.4 | VideoLLaMA2 72B |
| Video Question Answering | TVBench | Average Accuracy | 42.9 | VideoLLaMA2 7B |
| Video Question Answering | TVBench | Average Accuracy | 42.1 | VideoLLaMA2.1 |
| Video Question Answering | NExT-QA | Accuracy | 75.6 | VideoLLaMA2.1(7B) |
| Video Question Answering | Perception Test | Accuracy (Top-1) | 57.5 | VideoLLaMA2 (72B) |
| Video Question Answering | MVBench | Avg. | 62 | VideoLLaMA2 (72B) |
| Video Question Answering | Video-MME (w/o subs) | Accuracy (%) | 60.9 | VideoLLaMA2 (72B) |
| Video Question Answering | Video-MME | Accuracy (%) | 63.1 | VideoLLaMA2 (72B) |
| Video Question Answering | VNBench | Accuracy | 4.5 | VideoLLaMA2 |
| Video Question Answering | EgoSchema (fullset) | Accuracy | 63.9 | VideoLLaMA2 (72B) |
| Temporal Relation Extraction | Vinoground | Group Score | 8.4 | VideoLLaMA2-72B |
| Temporal Relation Extraction | Vinoground | Text Score | 36.2 | VideoLLaMA2-72B |
| Temporal Relation Extraction | Vinoground | Video Score | 21.8 | VideoLLaMA2-72B |