MMRL: Multi-Modal Representation Learning for Vision-Language Models

Yuncheng Guo, Xiaodong Gu

2025-03-11CVPR 2025 1Representation Learning Prompt Engineering Transfer Learning

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.

Results

Task	Dataset	Metric	Value	Model
Prompt Engineering	ImageNet-R	Top-1 accuracy %	77.53	MMRL
Prompt Engineering	Stanford Cars	Harmonic mean	78.06	MMRL
Prompt Engineering	Oxford 102 Flower	Harmonic mean	86.78	MMRL
Prompt Engineering	EuroSAT	Harmonic mean	87.21	MMRL
Prompt Engineering	Oxford-IIIT Pet Dataset	Harmonic mean	96.74	MMRL
Prompt Engineering	ImageNet-S	Top-1 accuracy %	49.17	MMRL
Prompt Engineering	DTD	Harmonic mean	73.82	MMRL
Prompt Engineering	UCF101	Harmonic mean	83.89	MMRL
Prompt Engineering	Food-101	Harmonic mean	91.03	MMRL
Prompt Engineering	Caltech-101	Harmonic mean	96.68	MMRL
Prompt Engineering	ImageNet	Harmonic mean	74.45	MMRL
Prompt Engineering	FGVC-Aircraft	Harmonic mean	41.15	MMRL
Prompt Engineering	SUN397	Harmonic mean	81.2	MMRL
Prompt Engineering	ImageNet-A	Top-1 accuracy %	51.2	MMRL
Prompt Engineering	ImageNet V2	Top-1 accuracy %	64.47	MMRL

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Abstract

Results

Related Papers

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Abstract

Results

Related Papers