Similarity Reasoning and Filtration for Image-Text Matching

Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu

2021-01-05Cross-Modal Retrieval Image-text matching Text Matching Sentence Retrieval Image Retrieval

Abstract

Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representations are firstly learned to characterize the local and global alignments in a more comprehensive manner, and then the Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments. The Similarity Attention Filtration (SAF) module is further developed to integrate these alignments effectively by selectively attending on the significant and representative alignments and meanwhile casting aside the interferences of non-meaningful alignments. We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF modules with extensive qualitative experiments and analyses.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	Flickr30K 1K test	R@1	58.5	SGRAF
Image Retrieval	Flickr30K 1K test	R@10	88.8	SGRAF
Image Retrieval	Flickr30K 1K test	R@5	83	SGRAF
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@1	77.8	SGRAF
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@10	97.4	SGRAF
Image Retrieval with Multi-Modal Query	Flickr30k	Image-to-text R@5	94.1	SGRAF
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@1	58.5	SGRAF
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@10	88.8	SGRAF
Image Retrieval with Multi-Modal Query	Flickr30k	Text-to-image R@5	83	SGRAF
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@1	57.8	SGRAF
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@10	91.6	SGRAF
Image Retrieval with Multi-Modal Query	COCO 2014	Image-to-text R@5	84.9	SGRAF
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@1	41.9	SGRAF
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@10	81.3	SGRAF
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	70.7	SGRAF
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@1	77.8	SGRAF
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@10	97.4	SGRAF
Cross-Modal Information Retrieval	Flickr30k	Image-to-text R@5	94.1	SGRAF
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@1	58.5	SGRAF
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@10	88.8	SGRAF
Cross-Modal Information Retrieval	Flickr30k	Text-to-image R@5	83	SGRAF
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@1	57.8	SGRAF
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@10	91.6	SGRAF
Cross-Modal Information Retrieval	COCO 2014	Image-to-text R@5	84.9	SGRAF
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@1	41.9	SGRAF
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@10	81.3	SGRAF
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	70.7	SGRAF
Cross-Modal Retrieval	Flickr30k	Image-to-text R@1	77.8	SGRAF
Cross-Modal Retrieval	Flickr30k	Image-to-text R@10	97.4	SGRAF
Cross-Modal Retrieval	Flickr30k	Image-to-text R@5	94.1	SGRAF
Cross-Modal Retrieval	Flickr30k	Text-to-image R@1	58.5	SGRAF
Cross-Modal Retrieval	Flickr30k	Text-to-image R@10	88.8	SGRAF
Cross-Modal Retrieval	Flickr30k	Text-to-image R@5	83	SGRAF
Cross-Modal Retrieval	COCO 2014	Image-to-text R@1	57.8	SGRAF
Cross-Modal Retrieval	COCO 2014	Image-to-text R@10	91.6	SGRAF
Cross-Modal Retrieval	COCO 2014	Image-to-text R@5	84.9	SGRAF
Cross-Modal Retrieval	COCO 2014	Text-to-image R@1	41.9	SGRAF
Cross-Modal Retrieval	COCO 2014	Text-to-image R@10	81.3	SGRAF
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	70.7	SGRAF

Similarity Reasoning and Filtration for Image-Text Matching

Abstract

Results

Related Papers

Similarity Reasoning and Filtration for Image-Text Matching

Abstract

Results

Related Papers