TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Plug-and-Play Regulators for Image-Text Matching

Plug-and-Play Regulators for Image-Text Matching

Haiwen Diao, Ying Zhang, Wei Liu, Xiang Ruan, Huchuan Lu

2023-03-23Cross-Modal RetrievalImage-text matchingText MatchingImage Retrieval
PaperPDFCode(official)

Abstract

Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are plug-and-play: both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods. Code and pre-trained models are available at: https://github.com/Paranioar/RCAR.

Results

TaskDatasetMetricValueModel
Image RetrievalFlickr30K 1K testR@162.6RCAR
Image RetrievalFlickr30K 1K testR@1091.1RCAR
Image RetrievalFlickr30K 1K testR@585.8RCAR
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@182.3RCAR
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1098.4RCAR
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@596RCAR
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@162.6RCAR
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1091.1RCAR
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@585.8RCAR
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@161.3RCAR
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1092.6RCAR
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@586.1RCAR
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@144.3RCAR
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1083.2RCAR
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@573.2RCAR
Cross-Modal Information RetrievalFlickr30kImage-to-text R@182.3RCAR
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1098.4RCAR
Cross-Modal Information RetrievalFlickr30kImage-to-text R@596RCAR
Cross-Modal Information RetrievalFlickr30kText-to-image R@162.6RCAR
Cross-Modal Information RetrievalFlickr30kText-to-image R@1091.1RCAR
Cross-Modal Information RetrievalFlickr30kText-to-image R@585.8RCAR
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@161.3RCAR
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1092.6RCAR
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@586.1RCAR
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@144.3RCAR
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1083.2RCAR
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@573.2RCAR
Cross-Modal RetrievalFlickr30kImage-to-text R@182.3RCAR
Cross-Modal RetrievalFlickr30kImage-to-text R@1098.4RCAR
Cross-Modal RetrievalFlickr30kImage-to-text R@596RCAR
Cross-Modal RetrievalFlickr30kText-to-image R@162.6RCAR
Cross-Modal RetrievalFlickr30kText-to-image R@1091.1RCAR
Cross-Modal RetrievalFlickr30kText-to-image R@585.8RCAR
Cross-Modal RetrievalCOCO 2014Image-to-text R@161.3RCAR
Cross-Modal RetrievalCOCO 2014Image-to-text R@1092.6RCAR
Cross-Modal RetrievalCOCO 2014Image-to-text R@586.1RCAR
Cross-Modal RetrievalCOCO 2014Text-to-image R@144.3RCAR
Cross-Modal RetrievalCOCO 2014Text-to-image R@1083.2RCAR
Cross-Modal RetrievalCOCO 2014Text-to-image R@573.2RCAR

Related Papers

FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval2025-07-09Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval2025-07-08An analysis of vision-language models for fabric retrieval2025-07-07Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model2025-07-07