Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Xiaojun Chang, Jingdong Wang

2023-12-27Cross-Modal Retrieval Cross-modal retrieval with noisy correspondence Retrieval Memorization

Abstract

Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on \emph{similarity-guided training with hard negatives} and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely \textbf{S}elf-\textbf{R}einforcing \textbf{E}rrors \textbf{M}itigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval with Multi-Modal Query	COCO-Noisy	Image-to-text R@1	78.5	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Image-to-text R@10	98.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Image-to-text R@5	96.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	R-Sum	524.1	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Text-to-image R@1	63.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Text-to-image R@10	95.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Text-to-image R@5	90.4	SREM
Image Retrieval with Multi-Modal Query	CC152K	Image-to-text R@1	40.9	SREM
Image Retrieval with Multi-Modal Query	CC152K	Image-to-text R@10	77.1	SREM
Image Retrieval with Multi-Modal Query	CC152K	Image-to-text R@5	67.5	SREM
Image Retrieval with Multi-Modal Query	CC152K	R-Sum	372.2	SREM
Image Retrieval with Multi-Modal Query	CC152K	Text-to-image R@1	41.5	SREM
Image Retrieval with Multi-Modal Query	CC152K	Text-to-image R@10	77	SREM
Image Retrieval with Multi-Modal Query	CC152K	Text-to-image R@5	68.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Image-to-text R@1	79.5	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Image-to-text R@10	97.9	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Image-to-text R@5	94.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	R-Sum	507.8	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Text-to-image R@1	61.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Text-to-image R@10	90.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Text-to-image R@5	84.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Image-to-text R@1	78.5	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Image-to-text R@10	98.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Image-to-text R@5	96.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	R-Sum	524.1	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Text-to-image R@1	63.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Text-to-image R@10	95.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Text-to-image R@5	90.4	SREM
Cross-Modal Information Retrieval	CC152K	Image-to-text R@1	40.9	SREM
Cross-Modal Information Retrieval	CC152K	Image-to-text R@10	77.1	SREM
Cross-Modal Information Retrieval	CC152K	Image-to-text R@5	67.5	SREM
Cross-Modal Information Retrieval	CC152K	R-Sum	372.2	SREM
Cross-Modal Information Retrieval	CC152K	Text-to-image R@1	41.5	SREM
Cross-Modal Information Retrieval	CC152K	Text-to-image R@10	77	SREM
Cross-Modal Information Retrieval	CC152K	Text-to-image R@5	68.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Image-to-text R@1	79.5	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Image-to-text R@10	97.9	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Image-to-text R@5	94.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	R-Sum	507.8	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Text-to-image R@1	61.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Text-to-image R@10	90.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Text-to-image R@5	84.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Image-to-text R@1	78.5	SREM
Cross-Modal Retrieval	COCO-Noisy	Image-to-text R@10	98.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Image-to-text R@5	96.8	SREM
Cross-Modal Retrieval	COCO-Noisy	R-Sum	524.1	SREM
Cross-Modal Retrieval	COCO-Noisy	Text-to-image R@1	63.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Text-to-image R@10	95.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Text-to-image R@5	90.4	SREM
Cross-Modal Retrieval	CC152K	Image-to-text R@1	40.9	SREM
Cross-Modal Retrieval	CC152K	Image-to-text R@10	77.1	SREM
Cross-Modal Retrieval	CC152K	Image-to-text R@5	67.5	SREM
Cross-Modal Retrieval	CC152K	R-Sum	372.2	SREM
Cross-Modal Retrieval	CC152K	Text-to-image R@1	41.5	SREM
Cross-Modal Retrieval	CC152K	Text-to-image R@10	77	SREM
Cross-Modal Retrieval	CC152K	Text-to-image R@5	68.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Image-to-text R@1	79.5	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Image-to-text R@10	97.9	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Image-to-text R@5	94.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	R-Sum	507.8	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Text-to-image R@1	61.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Text-to-image R@10	90.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Text-to-image R@5	84.8	SREM

Abstract

Results

Task	Dataset	Metric	Value	Model
Image Retrieval with Multi-Modal Query	COCO-Noisy	Image-to-text R@1	78.5	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Image-to-text R@10	98.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Image-to-text R@5	96.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	R-Sum	524.1	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Text-to-image R@1	63.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Text-to-image R@10	95.8	SREM
Image Retrieval with Multi-Modal Query	COCO-Noisy	Text-to-image R@5	90.4	SREM
Image Retrieval with Multi-Modal Query	CC152K	Image-to-text R@1	40.9	SREM
Image Retrieval with Multi-Modal Query	CC152K	Image-to-text R@10	77.1	SREM
Image Retrieval with Multi-Modal Query	CC152K	Image-to-text R@5	67.5	SREM
Image Retrieval with Multi-Modal Query	CC152K	R-Sum	372.2	SREM
Image Retrieval with Multi-Modal Query	CC152K	Text-to-image R@1	41.5	SREM
Image Retrieval with Multi-Modal Query	CC152K	Text-to-image R@10	77	SREM
Image Retrieval with Multi-Modal Query	CC152K	Text-to-image R@5	68.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Image-to-text R@1	79.5	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Image-to-text R@10	97.9	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Image-to-text R@5	94.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	R-Sum	507.8	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Text-to-image R@1	61.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Text-to-image R@10	90.2	SREM
Image Retrieval with Multi-Modal Query	Flickr30K-Noisy	Text-to-image R@5	84.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Image-to-text R@1	78.5	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Image-to-text R@10	98.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Image-to-text R@5	96.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	R-Sum	524.1	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Text-to-image R@1	63.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Text-to-image R@10	95.8	SREM
Cross-Modal Information Retrieval	COCO-Noisy	Text-to-image R@5	90.4	SREM
Cross-Modal Information Retrieval	CC152K	Image-to-text R@1	40.9	SREM
Cross-Modal Information Retrieval	CC152K	Image-to-text R@10	77.1	SREM
Cross-Modal Information Retrieval	CC152K	Image-to-text R@5	67.5	SREM
Cross-Modal Information Retrieval	CC152K	R-Sum	372.2	SREM
Cross-Modal Information Retrieval	CC152K	Text-to-image R@1	41.5	SREM
Cross-Modal Information Retrieval	CC152K	Text-to-image R@10	77	SREM
Cross-Modal Information Retrieval	CC152K	Text-to-image R@5	68.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Image-to-text R@1	79.5	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Image-to-text R@10	97.9	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Image-to-text R@5	94.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	R-Sum	507.8	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Text-to-image R@1	61.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Text-to-image R@10	90.2	SREM
Cross-Modal Information Retrieval	Flickr30K-Noisy	Text-to-image R@5	84.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Image-to-text R@1	78.5	SREM
Cross-Modal Retrieval	COCO-Noisy	Image-to-text R@10	98.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Image-to-text R@5	96.8	SREM
Cross-Modal Retrieval	COCO-Noisy	R-Sum	524.1	SREM
Cross-Modal Retrieval	COCO-Noisy	Text-to-image R@1	63.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Text-to-image R@10	95.8	SREM
Cross-Modal Retrieval	COCO-Noisy	Text-to-image R@5	90.4	SREM
Cross-Modal Retrieval	CC152K	Image-to-text R@1	40.9	SREM
Cross-Modal Retrieval	CC152K	Image-to-text R@10	77.1	SREM
Cross-Modal Retrieval	CC152K	Image-to-text R@5	67.5	SREM
Cross-Modal Retrieval	CC152K	R-Sum	372.2	SREM
Cross-Modal Retrieval	CC152K	Text-to-image R@1	41.5	SREM
Cross-Modal Retrieval	CC152K	Text-to-image R@10	77	SREM
Cross-Modal Retrieval	CC152K	Text-to-image R@5	68.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Image-to-text R@1	79.5	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Image-to-text R@10	97.9	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Image-to-text R@5	94.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	R-Sum	507.8	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Text-to-image R@1	61.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Text-to-image R@10	90.2	SREM
Cross-Modal Retrieval	Flickr30K-Noisy	Text-to-image R@5	84.8	SREM

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Abstract

Results

Related Papers

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Abstract

Results

Related Papers