TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cross-modal Active Complementary Learning with Self-refini...

Cross-modal Active Complementary Learning with Self-refining Correspondence

Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu

2023-10-26NeurIPS 2023 11Cross-modal retrieval with noisy correspondenceImage-text matchingText Matching
PaperPDFCode(official)

Abstract

Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@179.6CRCL
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@1098.7CRCL
Image Retrieval with Multi-Modal QueryCOCO-NoisyImage-to-text R@596.1CRCL
Image Retrieval with Multi-Modal QueryCOCO-NoisyR-Sum525.6CRCL
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@164.7CRCL
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@1095.9CRCL
Image Retrieval with Multi-Modal QueryCOCO-NoisyText-to-image R@590.6CRCL
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@141.8CRCL
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@1076.5CRCL
Image Retrieval with Multi-Modal QueryCC152KImage-to-text R@567.4CRCL
Image Retrieval with Multi-Modal QueryCC152KR-Sum373.7CRCL
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@141.6CRCL
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@1078.4CRCL
Image Retrieval with Multi-Modal QueryCC152KText-to-image R@568CRCL
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@177.9CRCL
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@1098.3CRCL
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyImage-to-text R@595.4CRCL
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyR-Sum507.8CRCL
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@160.9CRCL
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@1090.6CRCL
Image Retrieval with Multi-Modal QueryFlickr30K-NoisyText-to-image R@584.7CRCL
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@179.6CRCL
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@1098.7CRCL
Cross-Modal Information RetrievalCOCO-NoisyImage-to-text R@596.1CRCL
Cross-Modal Information RetrievalCOCO-NoisyR-Sum525.6CRCL
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@164.7CRCL
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@1095.9CRCL
Cross-Modal Information RetrievalCOCO-NoisyText-to-image R@590.6CRCL
Cross-Modal Information RetrievalCC152KImage-to-text R@141.8CRCL
Cross-Modal Information RetrievalCC152KImage-to-text R@1076.5CRCL
Cross-Modal Information RetrievalCC152KImage-to-text R@567.4CRCL
Cross-Modal Information RetrievalCC152KR-Sum373.7CRCL
Cross-Modal Information RetrievalCC152KText-to-image R@141.6CRCL
Cross-Modal Information RetrievalCC152KText-to-image R@1078.4CRCL
Cross-Modal Information RetrievalCC152KText-to-image R@568CRCL
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@177.9CRCL
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@1098.3CRCL
Cross-Modal Information RetrievalFlickr30K-NoisyImage-to-text R@595.4CRCL
Cross-Modal Information RetrievalFlickr30K-NoisyR-Sum507.8CRCL
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@160.9CRCL
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@1090.6CRCL
Cross-Modal Information RetrievalFlickr30K-NoisyText-to-image R@584.7CRCL
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@179.6CRCL
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@1098.7CRCL
Cross-Modal RetrievalCOCO-NoisyImage-to-text R@596.1CRCL
Cross-Modal RetrievalCOCO-NoisyR-Sum525.6CRCL
Cross-Modal RetrievalCOCO-NoisyText-to-image R@164.7CRCL
Cross-Modal RetrievalCOCO-NoisyText-to-image R@1095.9CRCL
Cross-Modal RetrievalCOCO-NoisyText-to-image R@590.6CRCL
Cross-Modal RetrievalCC152KImage-to-text R@141.8CRCL
Cross-Modal RetrievalCC152KImage-to-text R@1076.5CRCL
Cross-Modal RetrievalCC152KImage-to-text R@567.4CRCL
Cross-Modal RetrievalCC152KR-Sum373.7CRCL
Cross-Modal RetrievalCC152KText-to-image R@141.6CRCL
Cross-Modal RetrievalCC152KText-to-image R@1078.4CRCL
Cross-Modal RetrievalCC152KText-to-image R@568CRCL
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@177.9CRCL
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@1098.3CRCL
Cross-Modal RetrievalFlickr30K-NoisyImage-to-text R@595.4CRCL
Cross-Modal RetrievalFlickr30K-NoisyR-Sum507.8CRCL
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@160.9CRCL
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@1090.6CRCL
Cross-Modal RetrievalFlickr30K-NoisyText-to-image R@584.7CRCL

Related Papers

Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models2025-06-10TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP2025-05-24Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis2025-05-19Descriptive Image-Text Matching with Graded Contextual Similarity2025-05-15Compositional Image-Text Matching and Retrieval by Grounding Entities2025-05-04LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation2025-04-20Instruction-augmented Multimodal Alignment for Image-Text and Element Matching2025-04-16Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis2025-04-15