Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

2023-11-16Zero-Shot Video Question Answer Question Answering Video Question Answering Large Language Model Visual Question Answering (VQA)Temporal Relation Extraction Language Modelling Multiple-choice Visual Question Answering

Paper PDF Code Code Code Code Code(official)Code

Abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	6.6	Video-LLaVA-7B
Relation Extraction	Vinoground	Text Score	24.8	Video-LLaVA-7B
Relation Extraction	Vinoground	Video Score	25.8	Video-LLaVA-7B
Question Answering	MSVD-QA	Accuracy	70.7	Video-LLaVA-7B
Question Answering	MSVD-QA	Confidence Score	3.9	Video-LLaVA-7B
Question Answering	TGIF-QA	Accuracy	70	Video-LLaVA-7B
Question Answering	TGIF-QA	Confidence Score	4	Video-LLaVA-7B
Question Answering	MSRVTT-QA	Accuracy	59.2	Video-LLaVA-7B
Question Answering	MSRVTT-QA	Confidence Score	3.5	Video-LLaVA-7B
Question Answering	ActivityNet-QA	Accuracy	45.3	Video-LLaVA
Question Answering	ActivityNet-QA	Confidence Score	3.3	Video-LLaVA
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	32	Video-LLaVA
Video Question Answering	ActivityNet-QA	Accuracy	45.3	Video-LLaVA
Video Question Answering	ActivityNet-QA	Confidence score	3.3	Video-LLaVA
Video Question Answering	MSVD-QA	Accuracy	70.7	Video-LLaVA-7B
Video Question Answering	MSVD-QA	Confidence Score	3.9	Video-LLaVA-7B
Video Question Answering	TGIF-QA	Accuracy	70	Video-LLaVA-7B
Video Question Answering	TGIF-QA	Confidence Score	4	Video-LLaVA-7B
Video Question Answering	MSRVTT-QA	Accuracy	59.2	Video-LLaVA-7B
Video Question Answering	MSRVTT-QA	Confidence Score	3.5	Video-LLaVA-7B
Video Question Answering	ActivityNet-QA	Accuracy	45.3	Video-LLaVA
Video Question Answering	ActivityNet-QA	Confidence Score	3.3	Video-LLaVA
Temporal Relation Extraction	Vinoground	Group Score	6.6	Video-LLaVA-7B
Temporal Relation Extraction	Vinoground	Text Score	24.8	Video-LLaVA-7B
Temporal Relation Extraction	Vinoground	Video Score	25.8	Video-LLaVA-7B
Visual Question Answering	MM-Vet	GPT-4 score	32	Video-LLaVA

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Abstract

Results

Related Papers

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Abstract

Results

Related Papers