CROME: Cross-Modal Adapters for Efficient Multimodal LLM

Sayna Ebrahimi, Sercan O. Arik, Tejas Nama, Tomas Pfister

2024-08-13Question Answering Instruction Following Language Modelling Visual Question Answering

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	55.1	CROME (Vicuna-13B)
Visual Question Answering	MM-Vet	GPT-4 score	55.1	CROME (Vicuna-13B)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17