TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MMRL: Multi-Modal Representation Learning for Vision-Langu...

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Yuncheng Guo, Xiaodong Gu

2025-03-11CVPR 2025 1Representation LearningPrompt EngineeringTransfer Learning
PaperPDFCode(official)

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.

Results

TaskDatasetMetricValueModel
Prompt EngineeringImageNet-RTop-1 accuracy %77.53MMRL
Prompt EngineeringStanford CarsHarmonic mean78.06MMRL
Prompt EngineeringOxford 102 FlowerHarmonic mean86.78MMRL
Prompt EngineeringEuroSATHarmonic mean87.21MMRL
Prompt EngineeringOxford-IIIT Pet DatasetHarmonic mean96.74MMRL
Prompt EngineeringImageNet-STop-1 accuracy %49.17MMRL
Prompt EngineeringDTDHarmonic mean73.82MMRL
Prompt EngineeringUCF101Harmonic mean83.89MMRL
Prompt EngineeringFood-101Harmonic mean91.03MMRL
Prompt EngineeringCaltech-101Harmonic mean96.68MMRL
Prompt EngineeringImageNetHarmonic mean74.45MMRL
Prompt EngineeringFGVC-AircraftHarmonic mean41.15MMRL
Prompt EngineeringSUN397Harmonic mean81.2MMRL
Prompt EngineeringImageNet-ATop-1 accuracy %51.2MMRL
Prompt EngineeringImageNet V2Top-1 accuracy %64.47MMRL

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16