TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/3SHNet: Boosting Image-Sentence Retrieval via Visual Seman...

3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, Joemon M. Jose

2024-04-26Cross-Modal RetrievalSentence RetrievalRetrieval
PaperPDFCode(official)

Abstract

In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency. Moreover, our performance on cross-dataset generalization improves by 18.6%. Data and code are available at https://github.com/XuriGe1995/3SHNet.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@187.13SHNet
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.23SHNet
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@598.23SHNet
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@169.53SHNet
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1094.73SHNet
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@5913SHNet
Image Retrieval with Multi-Modal QueryMSCOCOImage-to-text R@185.83SHNet
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@167.93SHNet
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1095.43SHNet
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@590.53SHNet
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@150.33SHNet
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1087.73SHNet
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@579.33SHNet
Cross-Modal Information RetrievalFlickr30kImage-to-text R@187.13SHNet
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1099.23SHNet
Cross-Modal Information RetrievalFlickr30kImage-to-text R@598.23SHNet
Cross-Modal Information RetrievalFlickr30kText-to-image R@169.53SHNet
Cross-Modal Information RetrievalFlickr30kText-to-image R@1094.73SHNet
Cross-Modal Information RetrievalFlickr30kText-to-image R@5913SHNet
Cross-Modal Information RetrievalMSCOCOImage-to-text R@185.83SHNet
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@167.93SHNet
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1095.43SHNet
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@590.53SHNet
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@150.33SHNet
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1087.73SHNet
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@579.33SHNet
Cross-Modal RetrievalFlickr30kImage-to-text R@187.13SHNet
Cross-Modal RetrievalFlickr30kImage-to-text R@1099.23SHNet
Cross-Modal RetrievalFlickr30kImage-to-text R@598.23SHNet
Cross-Modal RetrievalFlickr30kText-to-image R@169.53SHNet
Cross-Modal RetrievalFlickr30kText-to-image R@1094.73SHNet
Cross-Modal RetrievalFlickr30kText-to-image R@5913SHNet
Cross-Modal RetrievalMSCOCOImage-to-text R@185.83SHNet
Cross-Modal RetrievalCOCO 2014Image-to-text R@167.93SHNet
Cross-Modal RetrievalCOCO 2014Image-to-text R@1095.43SHNet
Cross-Modal RetrievalCOCO 2014Image-to-text R@590.53SHNet
Cross-Modal RetrievalCOCO 2014Text-to-image R@150.33SHNet
Cross-Modal RetrievalCOCO 2014Text-to-image R@1087.73SHNet
Cross-Modal RetrievalCOCO 2014Text-to-image R@579.33SHNet

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15