TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Graph Structured Network for Image-Text Matching

Graph Structured Network for Image-Text Matching

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang

2020-04-01CVPR 2020 6Cross-Modal RetrievalImage-text matchingAttributeText Matching
PaperPDFCode(official)

Abstract

Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: https://github.com/CrossmodalGroup/GSMN.

Results

TaskDatasetMetricValueModel
Image RetrievalFlickr30kRecall@1089GSMN
Image RetrievalFlickr30kRecall@582.3GSMN
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@157.4GSMN
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1089GSMN
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@582.3GSMN
Cross-Modal Information RetrievalFlickr30kText-to-image R@157.4GSMN
Cross-Modal Information RetrievalFlickr30kText-to-image R@1089GSMN
Cross-Modal Information RetrievalFlickr30kText-to-image R@582.3GSMN
Cross-Modal RetrievalFlickr30kText-to-image R@157.4GSMN
Cross-Modal RetrievalFlickr30kText-to-image R@1089GSMN
Cross-Modal RetrievalFlickr30kText-to-image R@582.3GSMN

Related Papers

MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Non-Adaptive Adversarial Face Generation2025-07-16Attributes Shape the Embedding Space of Face Recognition Models2025-07-15COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation2025-07-15Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13Model Parallelism With Subnetwork Data Parallelism2025-07-11Bradley-Terry and Multi-Objective Reward Modeling Are Complementary2025-07-10Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09