TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning to Rematch Mismatched Pairs for Robust Cross-Moda...

Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval

Haochen Han, Qinghua Zheng, Guang Dai, Minnan Luo, Jingdong Wang

2024-03-08CVPR 2024 1Cross-Modal RetrievalCross-modal retrieval with noisy correspondenceSemantic SimilaritySemantic Textual SimilarityRetrieval
PaperPDFCode(official)

Abstract

Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval models. However, in real-world scenarios, massive multimodal data are harvested from the Internet, which inevitably contains Partially Mismatched Pairs (PMPs). Undoubtedly, such semantical irrelevant data will remarkably harm the cross-modal retrieval performance. Previous efforts tend to mitigate this problem by estimating a soft correspondence to down-weight the contribution of PMPs. In this paper, we aim to address this challenge from a new perspective: the potential semantic similarity among unpaired samples makes it possible to excavate useful knowledge from mismatched pairs. To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs. In detail, L2RM aims to generate refined alignments by seeking a minimal-cost transport plan across different modalities. To formalize the rematching idea in OT, first, we propose a self-supervised cost function that automatically learns from explicit similarity-cost mapping relation. Second, we present to model a partial OT problem while restricting the transport among false positives to further boost refined alignments. Extensive experiments on three benchmarks demonstrate our L2RM significantly improves the robustness against PMPs for existing models. The code is available at https://github.com/hhc1997/L2RM.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@180.2L2RM-SCARF
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@1098.5L2RM-SCARF
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@596.3L2RM-SCARF
Image Retrieval with Multi-Modal QueryCOCO-NoisyR-Sum524.7L2RM-SCARF
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@164.2L2RM-SCARF
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@1095.4L2RM-SCARF
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@590.1L2RM-SCARF
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@143L2RM-SGRAF
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@1075.7L2RM-SGRAF
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@567.5L2RM-SGRAF
Image Retrieval with Multi-Modal QueryCC152KR-Sum374.2L2RM-SGRAF
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@142.8L2RM-SGRAF
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@1077.2L2RM-SGRAF
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@568L2RM-SGRAF
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@177.9L2RM-SGRAF
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@1097.8L2RM-SGRAF
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@595.2L2RM-SGRAF
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyR-Sum503.8L2RM-SGRAF
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@159.8L2RM-SGRAF
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@1089.5L2RM-SGRAF
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@583.6L2RM-SGRAF
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@180.2L2RM-SCARF
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@1098.5L2RM-SCARF
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@596.3L2RM-SCARF
Cross-Modal Information RetrievalCOCO-NoisyR-Sum524.7L2RM-SCARF
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@164.2L2RM-SCARF
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@1095.4L2RM-SCARF
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@590.1L2RM-SCARF
Cross-Modal Information RetrievalCC152KImage-to-text R@143L2RM-SGRAF
Cross-Modal Information RetrievalCC152KImage-to-text R@1075.7L2RM-SGRAF
Cross-Modal Information RetrievalCC152KImage-to-text R@567.5L2RM-SGRAF
Cross-Modal Information RetrievalCC152KR-Sum374.2L2RM-SGRAF
Cross-Modal Information RetrievalCC152KText-to-image R@142.8L2RM-SGRAF
Cross-Modal Information RetrievalCC152KText-to-image R@1077.2L2RM-SGRAF
Cross-Modal Information RetrievalCC152KText-to-image R@568L2RM-SGRAF
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@177.9L2RM-SGRAF
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@1097.8L2RM-SGRAF
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@595.2L2RM-SGRAF
Cross-Modal Information RetrievalFlickr30K-NoisyR-Sum503.8L2RM-SGRAF
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@159.8L2RM-SGRAF
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@1089.5L2RM-SGRAF
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@583.6L2RM-SGRAF
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@180.2L2RM-SCARF
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@1098.5L2RM-SCARF
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@596.3L2RM-SCARF
Cross-Modal RetrievalCOCO-NoisyR-Sum524.7L2RM-SCARF
Cross-Modal RetrievalCOCO-NoisyText-to-image R@164.2L2RM-SCARF
Cross-Modal RetrievalCOCO-NoisyText-to-image R@1095.4L2RM-SCARF
Cross-Modal RetrievalCOCO-NoisyText-to-image R@590.1L2RM-SCARF
Cross-Modal RetrievalCC152KImage-to-text R@143L2RM-SGRAF
Cross-Modal RetrievalCC152KImage-to-text R@1075.7L2RM-SGRAF
Cross-Modal RetrievalCC152KImage-to-text R@567.5L2RM-SGRAF
Cross-Modal RetrievalCC152KR-Sum374.2L2RM-SGRAF
Cross-Modal RetrievalCC152KText-to-image R@142.8L2RM-SGRAF
Cross-Modal RetrievalCC152KText-to-image R@1077.2L2RM-SGRAF
Cross-Modal RetrievalCC152KText-to-image R@568L2RM-SGRAF
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@177.9L2RM-SGRAF
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@1097.8L2RM-SGRAF
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@595.2L2RM-SGRAF
Cross-Modal RetrievalFlickr30K-NoisyR-Sum503.8L2RM-SGRAF
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@159.8L2RM-SGRAF
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@1089.5L2RM-SGRAF
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@583.6L2RM-SGRAF

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16