TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VILA: On Pre-training for Visual Language Models

VILA: On Pre-training for Visual Language Models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han

2023-12-12CVPR 2024 1Zero-Shot Video Question AnswerWorld KnowledgeLanguage ModellingVisual Question Answering
PaperPDFCode(official)Code(official)Code(official)

Abstract

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.

Results

TaskDatasetMetricValueModel
Question AnsweringVideo-MME (w/o subs)Accuracy (%)61.4VILA-1.5 (34B)
Question AnsweringVideo-MMEAccuracy (%)64.1VILA-1.5 (34B)
Question AnsweringMSVD-QAAccuracy80.1VILA1.5-40B
Visual Question Answering (VQA)MM-VetGPT-4 score45.7VILA-13B
Video Question AnsweringVideo-MME (w/o subs)Accuracy (%)61.4VILA-1.5 (34B)
Video Question AnsweringVideo-MMEAccuracy (%)64.1VILA-1.5 (34B)
Video Question AnsweringMSVD-QAAccuracy80.1VILA1.5-40B
Visual Question AnsweringMM-VetGPT-4 score45.7VILA-13B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16