TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audi...

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

2024-06-11Zero-Shot Video Question AnswerQuestion AnsweringVideo Question AnsweringVideo CaptioningVisual Question Answering (VQA)Temporal Relation ExtractionMultiple-choice
PaperPDFCode(official)CodeCode

Abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score8.4VideoLLaMA2-72B
Relation ExtractionVinogroundText Score36.2VideoLLaMA2-72B
Relation ExtractionVinogroundVideo Score21.8VideoLLaMA2-72B
Question AnsweringVideo-MME (w/o subs)Accuracy (%)60.9VideoLLaMA2 (72B)
Question AnsweringVideo-MMEAccuracy (%)63.1VideoLLaMA2 (72B)
Question AnsweringVNBenchAccuracy4.5VideoLLaMA2
Question AnsweringEgoSchema (fullset)Accuracy63.9VideoLLaMA2 (72B)
Video Question AnsweringTVBenchAverage Accuracy48.4VideoLLaMA2 72B
Video Question AnsweringTVBenchAverage Accuracy42.9VideoLLaMA2 7B
Video Question AnsweringTVBenchAverage Accuracy42.1VideoLLaMA2.1
Video Question AnsweringNExT-QAAccuracy75.6VideoLLaMA2.1(7B)
Video Question AnsweringPerception TestAccuracy (Top-1)57.5VideoLLaMA2 (72B)
Video Question AnsweringMVBenchAvg.62VideoLLaMA2 (72B)
Video Question AnsweringVideo-MME (w/o subs)Accuracy (%)60.9VideoLLaMA2 (72B)
Video Question AnsweringVideo-MMEAccuracy (%)63.1VideoLLaMA2 (72B)
Video Question AnsweringVNBenchAccuracy4.5VideoLLaMA2
Video Question AnsweringEgoSchema (fullset)Accuracy63.9VideoLLaMA2 (72B)
Temporal Relation ExtractionVinogroundGroup Score8.4VideoLLaMA2-72B
Temporal Relation ExtractionVinogroundText Score36.2VideoLLaMA2-72B
Temporal Relation ExtractionVinogroundVideo Score21.8VideoLLaMA2-72B

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16