TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MMRL++: Parameter-Efficient and Interaction-Aware Represen...

MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Yuncheng Guo, Xiaodong Gu

2025-05-15Representation LearningPrompt EngineeringTransfer LearningGeneral Knowledge
PaperPDFCode(official)

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term aligning class and text features with the frozen VLM's zero-shot features. At inference, a decoupling strategy uses both class and representation features for base tasks, but only class features for novel tasks due to their stronger generalization. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters and enhances intra-modal interactions--particularly across the layers of representation tokens--allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.

Results

TaskDatasetMetricValueModel
Prompt EngineeringStanford CarsHarmonic mean78.18MMRL++
Prompt EngineeringOxford 102 FlowerHarmonic mean87.01MMRL++
Prompt EngineeringEuroSATHarmonic mean91.94MMRL++
Prompt EngineeringOxford-IIIT Pet DatasetHarmonic mean96.51MMRL++
Prompt EngineeringDTDHarmonic mean74.46MMRL++
Prompt EngineeringUCF101Harmonic mean83.81MMRL++
Prompt EngineeringFood-101Harmonic mean91.1MMRL++
Prompt EngineeringCaltech-101Harmonic mean96.75MMRL++
Prompt EngineeringImageNetHarmonic mean74.44MMRL++
Prompt EngineeringFGVC-AircraftHarmonic mean42.24MMRL++
Prompt EngineeringSUN397Harmonic mean81.28MMRL++

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16