TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video-LaVIT: Unified Video-Language Pre-training with Deco...

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang song, Kun Gai, Yadong Mu

2024-02-05Zero-Shot Video Question AnswerText-to-Video GenerationScience Question AnsweringVisual Question Answering (VQA)Visual Question AnsweringVideo Generation
PaperPDFCode

Abstract

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https://video-lavit.github.io.

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD16280.57Video-LaVIT
VideoUCF-101Inception Score44.26Video-LaVIT
Question AnsweringMSVD-QAAccuracy73.2Video-LaVIT
Question AnsweringMSVD-QAConfidence Score3.9Video-LaVIT
Question AnsweringMSRVTT-QAAccuracy59.3Video-LaVIT
Question AnsweringMSRVTT-QAConfidence Score3.3Video-LaVIT
Question AnsweringActivityNet-QAAccuracy50.1Video-LaVIT
Question AnsweringActivityNet-QAConfidence Score3.3Video-LaVIT
Question AnsweringScienceQAAvg. Accuracy70Video-LaVIT
Visual Question Answering (VQA)VizWiz 2020 VQAoverall56Video-LaVIT
Visual Question Answering (VQA)GQA test-devAccuracy64.4Video-LaVIT
Visual Question Answering (VQA)MMBenchGPT-3.5 score67.3Video-LaVIT
Visual Question Answering (VQA)MM-VetGPT-4 score33.2Video-LaVIT
Video Question AnsweringMSVD-QAAccuracy73.2Video-LaVIT
Video Question AnsweringMSVD-QAConfidence Score3.9Video-LaVIT
Video Question AnsweringMSRVTT-QAAccuracy59.3Video-LaVIT
Video Question AnsweringMSRVTT-QAConfidence Score3.3Video-LaVIT
Video Question AnsweringActivityNet-QAAccuracy50.1Video-LaVIT
Video Question AnsweringActivityNet-QAConfidence Score3.3Video-LaVIT
Video GenerationUCF-101FVD16280.57Video-LaVIT
Video GenerationUCF-101Inception Score44.26Video-LaVIT
Text-to-Video GenerationMSR-VTTCLIPSIM0.3012Video-LaVIT
Text-to-Video GenerationMSR-VTTFID11.27Video-LaVIT
Text-to-Video GenerationMSR-VTTFVD188.36Video-LaVIT
Visual Question AnsweringMMBenchGPT-3.5 score67.3Video-LaVIT
Visual Question AnsweringMM-VetGPT-4 score33.2Video-LaVIT

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12