TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video-LLaVA: Learning United Visual Representation by Alig...

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

2023-11-16Zero-Shot Video Question AnswerQuestion AnsweringVideo Question AnsweringLarge Language ModelVisual Question Answering (VQA)Temporal Relation ExtractionLanguage ModellingMultiple-choiceVisual Question Answering
PaperPDFCodeCodeCodeCodeCode(official)Code

Abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score6.6Video-LLaVA-7B
Relation ExtractionVinogroundText Score24.8Video-LLaVA-7B
Relation ExtractionVinogroundVideo Score25.8Video-LLaVA-7B
Question AnsweringMSVD-QAAccuracy70.7Video-LLaVA-7B
Question AnsweringMSVD-QAConfidence Score3.9Video-LLaVA-7B
Question AnsweringTGIF-QAAccuracy70Video-LLaVA-7B
Question AnsweringTGIF-QAConfidence Score4Video-LLaVA-7B
Question AnsweringMSRVTT-QAAccuracy59.2Video-LLaVA-7B
Question AnsweringMSRVTT-QAConfidence Score3.5Video-LLaVA-7B
Question AnsweringActivityNet-QAAccuracy45.3Video-LLaVA
Question AnsweringActivityNet-QAConfidence Score3.3Video-LLaVA
Visual Question Answering (VQA)MM-VetGPT-4 score32Video-LLaVA
Video Question AnsweringActivityNet-QAAccuracy45.3Video-LLaVA
Video Question AnsweringActivityNet-QAConfidence score3.3Video-LLaVA
Video Question AnsweringMSVD-QAAccuracy70.7Video-LLaVA-7B
Video Question AnsweringMSVD-QAConfidence Score3.9Video-LLaVA-7B
Video Question AnsweringTGIF-QAAccuracy70Video-LLaVA-7B
Video Question AnsweringTGIF-QAConfidence Score4Video-LLaVA-7B
Video Question AnsweringMSRVTT-QAAccuracy59.2Video-LLaVA-7B
Video Question AnsweringMSRVTT-QAConfidence Score3.5Video-LLaVA-7B
Video Question AnsweringActivityNet-QAAccuracy45.3Video-LLaVA
Video Question AnsweringActivityNet-QAConfidence Score3.3Video-LLaVA
Temporal Relation ExtractionVinogroundGroup Score6.6Video-LLaVA-7B
Temporal Relation ExtractionVinogroundText Score24.8Video-LLaVA-7B
Temporal Relation ExtractionVinogroundVideo Score25.8Video-LLaVA-7B
Visual Question AnsweringMM-VetGPT-4 score32Video-LLaVA

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17