TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/IMRAM: Iterative Matching with Recurrent Attention Memory ...

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han

2020-03-08CVPR 2020 6Cross-Modal RetrievalImage-text RetrievalText RetrievalRetrieval
PaperPDFCode(official)

Abstract

Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a fine-grained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an iterative matching scheme to explore such fine-grained correspondence progressively. A memory distillation unit is used to refine alignment knowledge from early steps to later ones. Experiment results on three benchmark datasets, i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves state-of-the-art performance, well demonstrating its effectiveness. Experiments on a practical business advertisement dataset, named \Ads{}, further validates the applicability of our method in practical scenarios.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@174.1IMRAM
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1096.6IMRAM
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@593IMRAM
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@153.9IMRAM
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1087.2IMRAM
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@579.4IMRAM
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@153.7IMRAM
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1091IMRAM
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@583.2IMRAM
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@139.7IMRAM
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1079.8IMRAM
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@569.1IMRAM
Cross-Modal Information RetrievalFlickr30kImage-to-text R@174.1IMRAM
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1096.6IMRAM
Cross-Modal Information RetrievalFlickr30kImage-to-text R@593IMRAM
Cross-Modal Information RetrievalFlickr30kText-to-image R@153.9IMRAM
Cross-Modal Information RetrievalFlickr30kText-to-image R@1087.2IMRAM
Cross-Modal Information RetrievalFlickr30kText-to-image R@579.4IMRAM
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@153.7IMRAM
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1091IMRAM
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@583.2IMRAM
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@139.7IMRAM
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1079.8IMRAM
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@569.1IMRAM
Cross-Modal RetrievalFlickr30kImage-to-text R@174.1IMRAM
Cross-Modal RetrievalFlickr30kImage-to-text R@1096.6IMRAM
Cross-Modal RetrievalFlickr30kImage-to-text R@593IMRAM
Cross-Modal RetrievalFlickr30kText-to-image R@153.9IMRAM
Cross-Modal RetrievalFlickr30kText-to-image R@1087.2IMRAM
Cross-Modal RetrievalFlickr30kText-to-image R@579.4IMRAM
Cross-Modal RetrievalCOCO 2014Image-to-text R@153.7IMRAM
Cross-Modal RetrievalCOCO 2014Image-to-text R@1091IMRAM
Cross-Modal RetrievalCOCO 2014Image-to-text R@583.2IMRAM
Cross-Modal RetrievalCOCO 2014Text-to-image R@139.7IMRAM
Cross-Modal RetrievalCOCO 2014Text-to-image R@1079.8IMRAM
Cross-Modal RetrievalCOCO 2014Text-to-image R@569.1IMRAM

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15