TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dynamic Self-adaptive Multiscale Distillation from Pre-tra...

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

Zhengyang Liang, Meiyu Liang, Wei Huang, Yawen Li, Zhe Xue

2024-04-16Cross-Modal RetrievalRepresentation Learning
PaperPDFCode(official)

Abstract

In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@182.5DSMD
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1097.7DSMD
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@595.5DSMD
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@168.4DSMD
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1094.4DSMD
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@590.8DSMD
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@148DSMD
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1084.5DSMD
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@575.6DSMD
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@162.1DSMD
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1092DSMD
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@585.9DSMD
Cross-Modal Information RetrievalFlickr30kImage-to-text R@182.5DSMD
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1097.7DSMD
Cross-Modal Information RetrievalFlickr30kImage-to-text R@595.5DSMD
Cross-Modal Information RetrievalFlickr30kText-to-image R@168.4DSMD
Cross-Modal Information RetrievalFlickr30kText-to-image R@1094.4DSMD
Cross-Modal Information RetrievalFlickr30kText-to-image R@590.8DSMD
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@148DSMD
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1084.5DSMD
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@575.6DSMD
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@162.1DSMD
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1092DSMD
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@585.9DSMD
Cross-Modal RetrievalFlickr30kImage-to-text R@182.5DSMD
Cross-Modal RetrievalFlickr30kImage-to-text R@1097.7DSMD
Cross-Modal RetrievalFlickr30kImage-to-text R@595.5DSMD
Cross-Modal RetrievalFlickr30kText-to-image R@168.4DSMD
Cross-Modal RetrievalFlickr30kText-to-image R@1094.4DSMD
Cross-Modal RetrievalFlickr30kText-to-image R@590.8DSMD
Cross-Modal RetrievalCOCO 2014Image-to-text R@148DSMD
Cross-Modal RetrievalCOCO 2014Image-to-text R@1084.5DSMD
Cross-Modal RetrievalCOCO 2014Image-to-text R@575.6DSMD
Cross-Modal RetrievalCOCO 2014Text-to-image R@162.1DSMD
Cross-Modal RetrievalCOCO 2014Text-to-image R@1092DSMD
Cross-Modal RetrievalCOCO 2014Text-to-image R@585.9DSMD

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15Dual Dimensions Geometric Representation Learning Based Document Dewarping2025-07-11