TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dual-Path Convolutional Image-Text Embeddings with Instanc...

Dual-Path Convolutional Image-Text Embeddings with Instance Loss

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, Yi-Dong Shen

2017-11-15Cross-Modal RetrievalContent-Based Image RetrievalNLP based Person RetrivalPerson RetrievalRetrievalText based Person Retrieval
PaperPDFCodeCode(official)

Abstract

Matching images and sentences demands a fine understanding of both modalities. In this paper, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image / text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss is hard for network learning, since it starts from the two heterogeneous features to build inter-modal relationship. To address this problem, we propose the instance loss which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image / text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this paper constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryMSCOCO-1kImage-to-text R@141.2Dual-path CNN
Image Retrieval with Multi-Modal QueryMSCOCO-1kText-to-image R@125.3Dual-path CNN
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@155.6Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@581.9Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1089.5Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@139.1Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1080.9Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@569.2Dual-Path (ResNet)
Image Retrieval with Multi-Modal QueryCUHK-PEDESText-to-image Medr2Dual Path
Text based Person RetrievalCUHK-PEDESR@144.4Dual Path
Text based Person RetrievalCUHK-PEDESR@1075.07Dual Path
Text based Person RetrievalCUHK-PEDESR@566.26Dual Path
Cross-Modal Information RetrievalMSCOCO-1kImage-to-text R@141.2Dual-path CNN
Cross-Modal Information RetrievalMSCOCO-1kText-to-image R@125.3Dual-path CNN
Cross-Modal Information RetrievalFlickr30kImage-to-text R@155.6Dual-Path (ResNet)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@581.9Dual-Path (ResNet)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1089.5Dual-Path (ResNet)
Cross-Modal Information RetrievalFlickr30kText-to-image R@139.1Dual-Path (ResNet)
Cross-Modal Information RetrievalFlickr30kText-to-image R@1080.9Dual-Path (ResNet)
Cross-Modal Information RetrievalFlickr30kText-to-image R@569.2Dual-Path (ResNet)
Cross-Modal Information RetrievalCUHK-PEDESText-to-image Medr2Dual Path
Cross-Modal RetrievalMSCOCO-1kImage-to-text R@141.2Dual-path CNN
Cross-Modal RetrievalMSCOCO-1kText-to-image R@125.3Dual-path CNN
Cross-Modal RetrievalFlickr30kImage-to-text R@155.6Dual-Path (ResNet)
Cross-Modal RetrievalFlickr30kImage-to-text R@581.9Dual-Path (ResNet)
Cross-Modal RetrievalFlickr30kImage-to-text R@1089.5Dual-Path (ResNet)
Cross-Modal RetrievalFlickr30kText-to-image R@139.1Dual-Path (ResNet)
Cross-Modal RetrievalFlickr30kText-to-image R@1080.9Dual-Path (ResNet)
Cross-Modal RetrievalFlickr30kText-to-image R@569.2Dual-Path (ResNet)
Cross-Modal RetrievalCUHK-PEDESText-to-image Medr2Dual Path

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15