NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu

2024-12-05CVPR 2025 1Video Question Answering

Paper PDF Code Code

Abstract

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.

Results

Task	Dataset	Metric	Value	Model
Video Question Answering	NExT-QA	Accuracy	82.2	NVILA(8B)

Related Papers

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28 LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs2025-06-27 How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?2025-06-19 video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18 CogStream: Context-guided Streaming Video Question Answering2025-06-12 V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning2025-06-11 CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models2025-06-11 Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning2025-06-09