LinVT: Empower Your Image-level Large Language Model to Understand Videos

Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao

2024-12-06Zero-Shot Video Question Answer Video Question Answering Large Language Model Video Understanding Language Modelling Visual Question Answering

Paper PDF Code(official)

Abstract

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MSVD-QA	Accuracy	80.2	LinVT-Qwen2-VL (7B)
Question Answering	MSVD-QA	Confidence Score	4.4	LinVT-Qwen2-VL (7B)
Question Answering	TGIF-QA	Accuracy	81.3	LinVT-Qwen2-VL (7B)
Question Answering	TGIF-QA	Confidence Score	4.3	LinVT-Qwen2-VL (7B)
Question Answering	MSRVTT-QA	Accuracy	66.2	LinVT-Qwen2-VL (7B)
Question Answering	MSRVTT-QA	Confidence Score	4	LinVT-Qwen2-VL (7B)
Question Answering	EgoSchema (fullset)	Accuracy	69.5	LinVT-Qwen2-VL(7B)
Question Answering	ActivityNet-QA	Accuracy	60.1	LinVT-Qwen2-VL(7B)
Question Answering	ActivityNet-QA	Confidence Score	3.6	LinVT-Qwen2-VL(7B)
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	23.5	LinVT
Video Question Answering	NExT-QA	Accuracy	85.5	LinVT-Qwen2-VL (7B)
Video Question Answering	MVBench	Avg.	69.3	LinVT-Qwen2-VL (7B)
Video Question Answering	MSVD-QA	Accuracy	80.2	LinVT-Qwen2-VL (7B)
Video Question Answering	MSVD-QA	Confidence Score	4.4	LinVT-Qwen2-VL (7B)
Video Question Answering	TGIF-QA	Accuracy	81.3	LinVT-Qwen2-VL (7B)
Video Question Answering	TGIF-QA	Confidence Score	4.3	LinVT-Qwen2-VL (7B)
Video Question Answering	MSRVTT-QA	Accuracy	66.2	LinVT-Qwen2-VL (7B)
Video Question Answering	MSRVTT-QA	Confidence Score	4	LinVT-Qwen2-VL (7B)
Video Question Answering	EgoSchema (fullset)	Accuracy	69.5	LinVT-Qwen2-VL(7B)
Video Question Answering	ActivityNet-QA	Accuracy	60.1	LinVT-Qwen2-VL(7B)
Video Question Answering	ActivityNet-QA	Confidence Score	3.6	LinVT-Qwen2-VL(7B)
Visual Question Answering	MM-Vet	GPT-4 score	23.5	LinVT

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Abstract

Results

Related Papers

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Abstract

Results

Related Papers