TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Res...

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Ruyi Xu, Yuan YAO, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang

2024-03-18Long-Context Understanding
PaperPDFCode(official)

Abstract

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

Results

TaskDatasetMetricValueModel
Long-Context UnderstandingMMNeedle1 Image, 2*2 Stitching, Exact Accuracy43.8LLaVA-Llama-3
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy17.5LLaVA-Llama-3
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy3.3LLaVA-Llama-3

Related Papers

Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?2025-06-20PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding2025-06-18DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration2025-06-06MesaNet: Sequence Modeling by Locally Optimal Test-Time Training2025-06-05ATLAS: Learning to Optimally Memorize the Context at Test Time2025-05-29SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences2025-05-27MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models2025-05-26