TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Noisy Correspondence Learning with Self-Reinforcing Errors...

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Xiaojun Chang, Jingdong Wang

2023-12-27Cross-Modal RetrievalCross-modal retrieval with noisy correspondenceRetrievalMemorization
PaperPDF

Abstract

Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on \emph{similarity-guided training with hard negatives} and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely \textbf{S}elf-\textbf{R}einforcing \textbf{E}rrors \textbf{M}itigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@178.5SREM
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@1098.8SREM
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@596.8SREM
Image Retrieval with Multi-Modal QueryCOCO-NoisyR-Sum524.1SREM
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@163.8SREM
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@1095.8SREM
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@590.4SREM
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@140.9SREM
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@1077.1SREM
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@567.5SREM
Image Retrieval with Multi-Modal QueryCC152KR-Sum372.2SREM
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@141.5SREM
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@1077SREM
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@568.2SREM
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@179.5SREM
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@1097.9SREM
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@594.2SREM
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyR-Sum507.8SREM
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@161.2SREM
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@1090.2SREM
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@584.8SREM
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@178.5SREM
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@1098.8SREM
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@596.8SREM
Cross-Modal Information RetrievalCOCO-NoisyR-Sum524.1SREM
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@163.8SREM
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@1095.8SREM
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@590.4SREM
Cross-Modal Information RetrievalCC152KImage-to-text R@140.9SREM
Cross-Modal Information RetrievalCC152KImage-to-text R@1077.1SREM
Cross-Modal Information RetrievalCC152KImage-to-text R@567.5SREM
Cross-Modal Information RetrievalCC152KR-Sum372.2SREM
Cross-Modal Information RetrievalCC152KText-to-image R@141.5SREM
Cross-Modal Information RetrievalCC152KText-to-image R@1077SREM
Cross-Modal Information RetrievalCC152KText-to-image R@568.2SREM
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@179.5SREM
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@1097.9SREM
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@594.2SREM
Cross-Modal Information RetrievalFlickr30K-NoisyR-Sum507.8SREM
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@161.2SREM
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@1090.2SREM
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@584.8SREM
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@178.5SREM
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@1098.8SREM
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@596.8SREM
Cross-Modal RetrievalCOCO-NoisyR-Sum524.1SREM
Cross-Modal RetrievalCOCO-NoisyText-to-image R@163.8SREM
Cross-Modal RetrievalCOCO-NoisyText-to-image R@1095.8SREM
Cross-Modal RetrievalCOCO-NoisyText-to-image R@590.4SREM
Cross-Modal RetrievalCC152KImage-to-text R@140.9SREM
Cross-Modal RetrievalCC152KImage-to-text R@1077.1SREM
Cross-Modal RetrievalCC152KImage-to-text R@567.5SREM
Cross-Modal RetrievalCC152KR-Sum372.2SREM
Cross-Modal RetrievalCC152KText-to-image R@141.5SREM
Cross-Modal RetrievalCC152KText-to-image R@1077SREM
Cross-Modal RetrievalCC152KText-to-image R@568.2SREM
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@179.5SREM
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@1097.9SREM
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@594.2SREM
Cross-Modal RetrievalFlickr30K-NoisyR-Sum507.8SREM
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@161.2SREM
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@1090.2SREM
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@584.8SREM

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15