TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video-LLaMA: An Instruction-tuned Audio-Visual Language Mo...

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, Lidong Bing

2023-06-05Zero-Shot Video Question AnswerText GenerationVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video-Text RetrievalVideo-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video UnderstandingLanguage Modelling
PaperPDFCodeCode(official)CodeCode

Abstract

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

Results

TaskDatasetMetricValueModel
VideoTest-of-Time2-Class Accuracy88.33Video-LLAMA
Question AnsweringMSVD-QAAccuracy51.6Video LLaMA-7B
Question AnsweringMSVD-QAConfidence Score2.5Video LLaMA-7B
Question AnsweringMSRVTT-QAAccuracy29.6Video LLaMA-7B
Question AnsweringMSRVTT-QAConfidence Score1.8Video LLaMA-7B
Question AnsweringActivityNet-QAAccuracy12.4Video LLaMA
Question AnsweringActivityNet-QAConfidence Score1.1Video LLaMA
Visual Question Answering (VQA)VideoInstructConsistency1.79Video LLaMA
Visual Question Answering (VQA)VideoInstructContextual Understanding2.16Video LLaMA
Visual Question Answering (VQA)VideoInstructCorrectness of Information1.96Video LLaMA
Visual Question Answering (VQA)VideoInstructDetail Orientation2.18Video LLaMA
Visual Question Answering (VQA)VideoInstructTemporal Understanding1.82Video LLaMA
Visual Question Answering (VQA)VideoInstructmean1.98Video LLaMA
Visual Question Answering (VQA)VideoInstructgpt-score2.16Video LLaMA
Visual Question Answering (VQA)VideoInstructgpt-score1.96Video LLaMA
Visual Question Answering (VQA)VideoInstructgpt-score2.18Video LLaMA
Visual Question Answering (VQA)VideoInstructgpt-score1.82Video LLaMA
Visual Question Answering (VQA)VideoInstructgpt-score1.79Video LLaMA
Video Question AnsweringMVBenchAvg.34.1VideoLLaMA
Video Question AnsweringMSVD-QAAccuracy51.6Video LLaMA-7B
Video Question AnsweringMSVD-QAConfidence Score2.5Video LLaMA-7B
Video Question AnsweringMSRVTT-QAAccuracy29.6Video LLaMA-7B
Video Question AnsweringMSRVTT-QAConfidence Score1.8Video LLaMA-7B
Video Question AnsweringActivityNet-QAAccuracy12.4Video LLaMA
Video Question AnsweringActivityNet-QAConfidence Score1.1Video LLaMA
Video RetrievalTest-of-Time2-Class Accuracy88.33Video-LLAMA
Generative Visual Question AnsweringVideoInstructConsistency1.79Video LLaMA
Generative Visual Question AnsweringVideoInstructContextual Understanding2.16Video LLaMA
Generative Visual Question AnsweringVideoInstructCorrectness of Information1.96Video LLaMA
Generative Visual Question AnsweringVideoInstructDetail Orientation2.18Video LLaMA
Generative Visual Question AnsweringVideoInstructTemporal Understanding1.82Video LLaMA
Generative Visual Question AnsweringVideoInstructmean1.98Video LLaMA
Generative Visual Question AnsweringVideoInstructgpt-score2.16Video LLaMA
Generative Visual Question AnsweringVideoInstructgpt-score1.96Video LLaMA
Generative Visual Question AnsweringVideoInstructgpt-score2.18Video LLaMA
Generative Visual Question AnsweringVideoInstructgpt-score1.82Video LLaMA
Generative Visual Question AnsweringVideoInstructgpt-score1.79Video LLaMA
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score1.96Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructConsistency1.79Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding2.16Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information1.96Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation2.18Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding1.82Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructmean1.98Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.16Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructgpt-score1.96Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.18Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructgpt-score1.82Video LLaMA
Video-based Generative Performance BenchmarkingVideoInstructgpt-score1.79Video LLaMA

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16Assay2Mol: large language model-based drug design using BioAssay context2025-07-16