TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LinVT: Empower Your Image-level Large Language Model to Un...

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao

2024-12-06Zero-Shot Video Question AnswerVideo Question AnsweringLarge Language ModelVideo UnderstandingLanguage ModellingVisual Question Answering
PaperPDFCode(official)

Abstract

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy80.2LinVT-Qwen2-VL (7B)
Question AnsweringMSVD-QAConfidence Score4.4LinVT-Qwen2-VL (7B)
Question AnsweringTGIF-QAAccuracy81.3LinVT-Qwen2-VL (7B)
Question AnsweringTGIF-QAConfidence Score4.3LinVT-Qwen2-VL (7B)
Question AnsweringMSRVTT-QAAccuracy66.2LinVT-Qwen2-VL (7B)
Question AnsweringMSRVTT-QAConfidence Score4LinVT-Qwen2-VL (7B)
Question AnsweringEgoSchema (fullset)Accuracy69.5LinVT-Qwen2-VL(7B)
Question AnsweringActivityNet-QAAccuracy60.1LinVT-Qwen2-VL(7B)
Question AnsweringActivityNet-QAConfidence Score3.6LinVT-Qwen2-VL(7B)
Visual Question Answering (VQA)MM-VetGPT-4 score23.5LinVT
Video Question AnsweringNExT-QAAccuracy85.5LinVT-Qwen2-VL (7B)
Video Question AnsweringMVBenchAvg.69.3LinVT-Qwen2-VL (7B)
Video Question AnsweringMSVD-QAAccuracy80.2LinVT-Qwen2-VL (7B)
Video Question AnsweringMSVD-QAConfidence Score4.4LinVT-Qwen2-VL (7B)
Video Question AnsweringTGIF-QAAccuracy81.3LinVT-Qwen2-VL (7B)
Video Question AnsweringTGIF-QAConfidence Score4.3LinVT-Qwen2-VL (7B)
Video Question AnsweringMSRVTT-QAAccuracy66.2LinVT-Qwen2-VL (7B)
Video Question AnsweringMSRVTT-QAConfidence Score4LinVT-Qwen2-VL (7B)
Video Question AnsweringEgoSchema (fullset)Accuracy69.5LinVT-Qwen2-VL(7B)
Video Question AnsweringActivityNet-QAAccuracy60.1LinVT-Qwen2-VL(7B)
Video Question AnsweringActivityNet-QAConfidence Score3.6LinVT-Qwen2-VL(7B)
Visual Question AnsweringMM-VetGPT-4 score23.5LinVT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17