Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, Lidong Bing

2023-06-05Zero-Shot Video Question Answer Text Generation Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Contextual Understanding)Video-Text Retrieval Video-based Generative Performance Benchmarking (Correctness of Information)Video Question Answering Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Video-based Generative Performance Benchmarking (Detail Orientation))Video Understanding Language Modelling

Paper PDF Code Code(official)Code Code

Abstract

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

Results

Task	Dataset	Metric	Value	Model
Video	Test-of-Time	2-Class Accuracy	88.33	Video-LLAMA
Question Answering	MSVD-QA	Accuracy	51.6	Video LLaMA-7B
Question Answering	MSVD-QA	Confidence Score	2.5	Video LLaMA-7B
Question Answering	MSRVTT-QA	Accuracy	29.6	Video LLaMA-7B
Question Answering	MSRVTT-QA	Confidence Score	1.8	Video LLaMA-7B
Question Answering	ActivityNet-QA	Accuracy	12.4	Video LLaMA
Question Answering	ActivityNet-QA	Confidence Score	1.1	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	Consistency	1.79	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	2.16	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	1.96	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	2.18	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	1.82	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	mean	1.98	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.16	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	gpt-score	1.96	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.18	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	gpt-score	1.82	Video LLaMA
Visual Question Answering (VQA)	VideoInstruct	gpt-score	1.79	Video LLaMA
Video Question Answering	MVBench	Avg.	34.1	VideoLLaMA
Video Question Answering	MSVD-QA	Accuracy	51.6	Video LLaMA-7B
Video Question Answering	MSVD-QA	Confidence Score	2.5	Video LLaMA-7B
Video Question Answering	MSRVTT-QA	Accuracy	29.6	Video LLaMA-7B
Video Question Answering	MSRVTT-QA	Confidence Score	1.8	Video LLaMA-7B
Video Question Answering	ActivityNet-QA	Accuracy	12.4	Video LLaMA
Video Question Answering	ActivityNet-QA	Confidence Score	1.1	Video LLaMA
Video Retrieval	Test-of-Time	2-Class Accuracy	88.33	Video-LLAMA
Generative Visual Question Answering	VideoInstruct	Consistency	1.79	Video LLaMA
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	2.16	Video LLaMA
Generative Visual Question Answering	VideoInstruct	Correctness of Information	1.96	Video LLaMA
Generative Visual Question Answering	VideoInstruct	Detail Orientation	2.18	Video LLaMA
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	1.82	Video LLaMA
Generative Visual Question Answering	VideoInstruct	mean	1.98	Video LLaMA
Generative Visual Question Answering	VideoInstruct	gpt-score	2.16	Video LLaMA
Generative Visual Question Answering	VideoInstruct	gpt-score	1.96	Video LLaMA
Generative Visual Question Answering	VideoInstruct	gpt-score	2.18	Video LLaMA
Generative Visual Question Answering	VideoInstruct	gpt-score	1.82	Video LLaMA
Generative Visual Question Answering	VideoInstruct	gpt-score	1.79	Video LLaMA
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	1.96	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	1.79	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	2.16	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	1.96	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	2.18	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	1.82	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	mean	1.98	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.16	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	1.96	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.18	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	1.82	Video LLaMA
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	1.79	Video LLaMA

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Abstract

Results

Related Papers

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Abstract

Results

Related Papers