TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dissecting Deep Metric Learning Losses for Image-Text Retr...

Dissecting Deep Metric Learning Losses for Image-Text Retrieval

Hong Xuan, Xi Chen

2022-10-21Cross-Modal RetrievalImage-text RetrievalImage-text matchingText MatchingMetric LearningText RetrievalRetrievalLanguage Modelling
PaperPDFCode(official)Code

Abstract

Visual-Semantic Embedding (VSE) is a prevalent approach in image-text retrieval by learning a joint embedding space between the image and language modalities where semantic similarities would be preserved. The triplet loss with hard-negative mining has become the de-facto objective for most VSE methods. Inspired by recent progress in deep metric learning (DML) in the image domain which gives rise to new loss functions that outperform triplet loss, in this paper, we revisit the problem of finding better objectives for VSE in image-text matching. Despite some attempts in designing losses based on gradient movement, most DML losses are defined empirically in the embedding space. Instead of directly applying these loss functions which may lead to sub-optimal gradient updates in model parameters, in this paper we present a novel Gradient-based Objective AnaLysis framework, or \textit{GOAL}, to systematically analyze the combinations and reweighting of the gradients in existing DML functions. With the help of this analysis framework, we further propose a new family of objectives in the gradient space exploring different gradient combinations. In the event that the gradients are not integrable to a valid loss function, we implement our proposed objectives such that they would directly operate in the gradient space instead of on the losses in the embedding space. Comprehensive experiments have demonstrated that our novel objectives have consistently improved performance over baselines across different visual/text features and model frameworks. We also showed the generalizability of the GOAL framework by extending it to other models using triplet family losses including vision-language model with heavy cross-modal interactions and have achieved state-of-the-art results on the image-text retrieval tasks on COCO and Flick30K.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@197VSE-Gradient
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@10100VSE-Gradient
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.6VSE-Gradient
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@186.3VSE-Gradient
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1099VSE-Gradient
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@597.4VSE-Gradient
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@181.4VSE-Gradient
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1097.9VSE-Gradient
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@595.6VSE-Gradient
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@163.6VSE-Gradient
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1091.5VSE-Gradient
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@586VSE-Gradient
Cross-Modal Information RetrievalFlickr30kImage-to-text R@197VSE-Gradient
Cross-Modal Information RetrievalFlickr30kImage-to-text R@10100VSE-Gradient
Cross-Modal Information RetrievalFlickr30kImage-to-text R@599.6VSE-Gradient
Cross-Modal Information RetrievalFlickr30kText-to-image R@186.3VSE-Gradient
Cross-Modal Information RetrievalFlickr30kText-to-image R@1099VSE-Gradient
Cross-Modal Information RetrievalFlickr30kText-to-image R@597.4VSE-Gradient
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@181.4VSE-Gradient
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1097.9VSE-Gradient
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@595.6VSE-Gradient
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@163.6VSE-Gradient
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1091.5VSE-Gradient
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@586VSE-Gradient
Cross-Modal RetrievalFlickr30kImage-to-text R@197VSE-Gradient
Cross-Modal RetrievalFlickr30kImage-to-text R@10100VSE-Gradient
Cross-Modal RetrievalFlickr30kImage-to-text R@599.6VSE-Gradient
Cross-Modal RetrievalFlickr30kText-to-image R@186.3VSE-Gradient
Cross-Modal RetrievalFlickr30kText-to-image R@1099VSE-Gradient
Cross-Modal RetrievalFlickr30kText-to-image R@597.4VSE-Gradient
Cross-Modal RetrievalCOCO 2014Image-to-text R@181.4VSE-Gradient
Cross-Modal RetrievalCOCO 2014Image-to-text R@1097.9VSE-Gradient
Cross-Modal RetrievalCOCO 2014Image-to-text R@595.6VSE-Gradient
Cross-Modal RetrievalCOCO 2014Text-to-image R@163.6VSE-Gradient
Cross-Modal RetrievalCOCO 2014Text-to-image R@1091.5VSE-Gradient
Cross-Modal RetrievalCOCO 2014Text-to-image R@586VSE-Gradient

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Unsupervised Ground Metric Learning2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17