Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao
Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | MSVD-QA | Accuracy | 80.2 | LinVT-Qwen2-VL (7B) |
| Question Answering | MSVD-QA | Confidence Score | 4.4 | LinVT-Qwen2-VL (7B) |
| Question Answering | TGIF-QA | Accuracy | 81.3 | LinVT-Qwen2-VL (7B) |
| Question Answering | TGIF-QA | Confidence Score | 4.3 | LinVT-Qwen2-VL (7B) |
| Question Answering | MSRVTT-QA | Accuracy | 66.2 | LinVT-Qwen2-VL (7B) |
| Question Answering | MSRVTT-QA | Confidence Score | 4 | LinVT-Qwen2-VL (7B) |
| Question Answering | EgoSchema (fullset) | Accuracy | 69.5 | LinVT-Qwen2-VL(7B) |
| Question Answering | ActivityNet-QA | Accuracy | 60.1 | LinVT-Qwen2-VL(7B) |
| Question Answering | ActivityNet-QA | Confidence Score | 3.6 | LinVT-Qwen2-VL(7B) |
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 23.5 | LinVT |
| Video Question Answering | NExT-QA | Accuracy | 85.5 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | MVBench | Avg. | 69.3 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | MSVD-QA | Accuracy | 80.2 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | MSVD-QA | Confidence Score | 4.4 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | TGIF-QA | Accuracy | 81.3 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | TGIF-QA | Confidence Score | 4.3 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | MSRVTT-QA | Accuracy | 66.2 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | MSRVTT-QA | Confidence Score | 4 | LinVT-Qwen2-VL (7B) |
| Video Question Answering | EgoSchema (fullset) | Accuracy | 69.5 | LinVT-Qwen2-VL(7B) |
| Video Question Answering | ActivityNet-QA | Accuracy | 60.1 | LinVT-Qwen2-VL(7B) |
| Video Question Answering | ActivityNet-QA | Confidence Score | 3.6 | LinVT-Qwen2-VL(7B) |
| Visual Question Answering | MM-Vet | GPT-4 score | 23.5 | LinVT |