TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ReCon: Enhancing True Correspondence Discrimination throug...

ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning

Quanxing Zha, Xin Liu, Shu-Juan Peng, Yiu-ming Cheung, Xing Xu, Nannan Wang

2025-02-27CVPR 2025 1Cross-Modal RetrievalCross-modal retrieval with noisy correspondenceImage-text RetrievalImage-text matching
PaperPDFCode(official)

Abstract

Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: https://github.com/qxzha/ReCon.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@180.9ReCon
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@1098.8ReCon
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@596.6ReCon
Image Retrieval with Multi-Modal QueryCOCO-NoisyR-Sum528.6ReCon
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@165.2ReCon
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@1096ReCon
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@591ReCon
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@143.1ReCon
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@1078.1ReCon
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@568.7ReCon
Image Retrieval with Multi-Modal QueryCC152KR-Sum380.5ReCon
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@144.9ReCon
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@1077.4ReCon
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@568.3ReCon
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@180.3ReCon
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@1097.8ReCon
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@595.3ReCon
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyR-Sum511.8ReCon
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@161.6ReCon
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@1091.3ReCon
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@585.5ReCon
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@180.9ReCon
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@1098.8ReCon
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@596.6ReCon
Cross-Modal Information RetrievalCOCO-NoisyR-Sum528.6ReCon
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@165.2ReCon
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@1096ReCon
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@591ReCon
Cross-Modal Information RetrievalCC152KImage-to-text R@143.1ReCon
Cross-Modal Information RetrievalCC152KImage-to-text R@1078.1ReCon
Cross-Modal Information RetrievalCC152KImage-to-text R@568.7ReCon
Cross-Modal Information RetrievalCC152KR-Sum380.5ReCon
Cross-Modal Information RetrievalCC152KText-to-image R@144.9ReCon
Cross-Modal Information RetrievalCC152KText-to-image R@1077.4ReCon
Cross-Modal Information RetrievalCC152KText-to-image R@568.3ReCon
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@180.3ReCon
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@1097.8ReCon
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@595.3ReCon
Cross-Modal Information RetrievalFlickr30K-NoisyR-Sum511.8ReCon
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@161.6ReCon
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@1091.3ReCon
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@585.5ReCon
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@180.9ReCon
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@1098.8ReCon
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@596.6ReCon
Cross-Modal RetrievalCOCO-NoisyR-Sum528.6ReCon
Cross-Modal RetrievalCOCO-NoisyText-to-image R@165.2ReCon
Cross-Modal RetrievalCOCO-NoisyText-to-image R@1096ReCon
Cross-Modal RetrievalCOCO-NoisyText-to-image R@591ReCon
Cross-Modal RetrievalCC152KImage-to-text R@143.1ReCon
Cross-Modal RetrievalCC152KImage-to-text R@1078.1ReCon
Cross-Modal RetrievalCC152KImage-to-text R@568.7ReCon
Cross-Modal RetrievalCC152KR-Sum380.5ReCon
Cross-Modal RetrievalCC152KText-to-image R@144.9ReCon
Cross-Modal RetrievalCC152KText-to-image R@1077.4ReCon
Cross-Modal RetrievalCC152KText-to-image R@568.3ReCon
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@180.3ReCon
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@1097.8ReCon
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@595.3ReCon
Cross-Modal RetrievalFlickr30K-NoisyR-Sum511.8ReCon
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@161.6ReCon
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@1091.3ReCon
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@585.5ReCon

Related Papers

An analysis of vision-language models for fabric retrieval2025-07-07Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval2025-06-26Multimodal Medical Image Binding via Shared Text Embeddings2025-06-22ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 20252025-06-12FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models2025-06-12Adding simple structure at inference improves Vision-Language Compositionality2025-06-11FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation2025-06-10