TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Noisy Correspondence Learning with Meta Similarity Correct...

Noisy Correspondence Learning with Meta Similarity Correction

Haochen Han, Kaiyao Miao, Qinghua Zheng, Minnan Luo

2023-04-13CVPR 2023 1Cross-Modal RetrievalCross-modal retrieval with noisy correspondenceBinary ClassificationRetrieval
PaperPDFCode(official)

Abstract

Despite the success of multimodal learning in cross-modal retrieval task, the remarkable progress relies on the correct correspondence among multimedia data. However, collecting such ideal data is expensive and time-consuming. In practice, most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. Training on such noisy correspondence datasets causes performance degradation because the cross-modal retrieval methods can wrongly enforce the mismatched data to be similar. To tackle this problem, we propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores. We view a binary classification task as the meta-process that encourages the MSCN to learn discrimination from positive and negative meta-data. To further alleviate the influence of noise, we design an effective data purification strategy using meta-data as prior knowledge to remove the noisy samples. Extensive experiments are conducted to demonstrate the strengths of our method in both synthetic and real-world noises, including Flickr30K, MS-COCO, and Conceptual Captions.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@178.1MSCN
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@1098.8MSCN
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@597.2MSCN
Image Retrieval with Multi-Modal QueryCOCO-NoisyR-Sum524.6MSCN
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@164.3MSCN
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@1095.8MSCN
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@590.4MSCN
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@140.1MSCN
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@1076.6MSCN
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@565.7MSCN
Image Retrieval with Multi-Modal QueryCC152KR-Sum366.7MSCN
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@140.6MSCN
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@1076.3MSCN
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@567.4MSCN
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@177.4MSCN
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@1097.6MSCN
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@594.9MSCN
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyR-Sum501.9MSCN
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@159.6MSCN
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@1089.2MSCN
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@583.2MSCN
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@178.1MSCN
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@1098.8MSCN
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@597.2MSCN
Cross-Modal Information RetrievalCOCO-NoisyR-Sum524.6MSCN
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@164.3MSCN
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@1095.8MSCN
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@590.4MSCN
Cross-Modal Information RetrievalCC152KImage-to-text R@140.1MSCN
Cross-Modal Information RetrievalCC152KImage-to-text R@1076.6MSCN
Cross-Modal Information RetrievalCC152KImage-to-text R@565.7MSCN
Cross-Modal Information RetrievalCC152KR-Sum366.7MSCN
Cross-Modal Information RetrievalCC152KText-to-image R@140.6MSCN
Cross-Modal Information RetrievalCC152KText-to-image R@1076.3MSCN
Cross-Modal Information RetrievalCC152KText-to-image R@567.4MSCN
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@177.4MSCN
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@1097.6MSCN
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@594.9MSCN
Cross-Modal Information RetrievalFlickr30K-NoisyR-Sum501.9MSCN
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@159.6MSCN
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@1089.2MSCN
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@583.2MSCN
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@178.1MSCN
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@1098.8MSCN
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@597.2MSCN
Cross-Modal RetrievalCOCO-NoisyR-Sum524.6MSCN
Cross-Modal RetrievalCOCO-NoisyText-to-image R@164.3MSCN
Cross-Modal RetrievalCOCO-NoisyText-to-image R@1095.8MSCN
Cross-Modal RetrievalCOCO-NoisyText-to-image R@590.4MSCN
Cross-Modal RetrievalCC152KImage-to-text R@140.1MSCN
Cross-Modal RetrievalCC152KImage-to-text R@1076.6MSCN
Cross-Modal RetrievalCC152KImage-to-text R@565.7MSCN
Cross-Modal RetrievalCC152KR-Sum366.7MSCN
Cross-Modal RetrievalCC152KText-to-image R@140.6MSCN
Cross-Modal RetrievalCC152KText-to-image R@1076.3MSCN
Cross-Modal RetrievalCC152KText-to-image R@567.4MSCN
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@177.4MSCN
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@1097.6MSCN
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@594.9MSCN
Cross-Modal RetrievalFlickr30K-NoisyR-Sum501.9MSCN
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@159.6MSCN
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@1089.2MSCN
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@583.2MSCN

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15