TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Similarity Reasoning and Filtration for Image-Text Matching

Similarity Reasoning and Filtration for Image-Text Matching

Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu

2021-01-05Cross-Modal RetrievalImage-text matchingText MatchingSentence RetrievalImage Retrieval
PaperPDFCode(official)

Abstract

Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representations are firstly learned to characterize the local and global alignments in a more comprehensive manner, and then the Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments. The Similarity Attention Filtration (SAF) module is further developed to integrate these alignments effectively by selectively attending on the significant and representative alignments and meanwhile casting aside the interferences of non-meaningful alignments. We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF modules with extensive qualitative experiments and analyses.

Results

TaskDatasetMetricValueModel
Image RetrievalFlickr30K 1K testR@158.5SGRAF
Image RetrievalFlickr30K 1K testR@1088.8SGRAF
Image RetrievalFlickr30K 1K testR@583SGRAF
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@177.8SGRAF
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1097.4SGRAF
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@594.1SGRAF
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@158.5SGRAF
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1088.8SGRAF
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@583SGRAF
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@157.8SGRAF
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1091.6SGRAF
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@584.9SGRAF
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@141.9SGRAF
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1081.3SGRAF
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@570.7SGRAF
Cross-Modal Information RetrievalFlickr30kImage-to-text R@177.8SGRAF
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1097.4SGRAF
Cross-Modal Information RetrievalFlickr30kImage-to-text R@594.1SGRAF
Cross-Modal Information RetrievalFlickr30kText-to-image R@158.5SGRAF
Cross-Modal Information RetrievalFlickr30kText-to-image R@1088.8SGRAF
Cross-Modal Information RetrievalFlickr30kText-to-image R@583SGRAF
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@157.8SGRAF
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1091.6SGRAF
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@584.9SGRAF
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@141.9SGRAF
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1081.3SGRAF
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@570.7SGRAF
Cross-Modal RetrievalFlickr30kImage-to-text R@177.8SGRAF
Cross-Modal RetrievalFlickr30kImage-to-text R@1097.4SGRAF
Cross-Modal RetrievalFlickr30kImage-to-text R@594.1SGRAF
Cross-Modal RetrievalFlickr30kText-to-image R@158.5SGRAF
Cross-Modal RetrievalFlickr30kText-to-image R@1088.8SGRAF
Cross-Modal RetrievalFlickr30kText-to-image R@583SGRAF
Cross-Modal RetrievalCOCO 2014Image-to-text R@157.8SGRAF
Cross-Modal RetrievalCOCO 2014Image-to-text R@1091.6SGRAF
Cross-Modal RetrievalCOCO 2014Image-to-text R@584.9SGRAF
Cross-Modal RetrievalCOCO 2014Text-to-image R@141.9SGRAF
Cross-Modal RetrievalCOCO 2014Text-to-image R@1081.3SGRAF
Cross-Modal RetrievalCOCO 2014Text-to-image R@570.7SGRAF

Related Papers

FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval2025-07-09Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval2025-07-08An analysis of vision-language models for fabric retrieval2025-07-07Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model2025-07-07