TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Polysemous Visual-Semantic Embedding for Cross-Modal Retri...

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Yale Song, Mohammad Soleymani

2019-06-11CVPR 2019 6Cross-Modal RetrievalVideo-Text RetrievalText RetrievalMultiple Instance LearningRetrieval
PaperPDFCode

Abstract

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Unfortunately, injective embedding cannot effectively handle polysemous instances with multiple possible meanings; at best, it would find an average representation of different meanings. This hinders its use in real-world scenarios where individual instances and their cross-modal associations are often ambiguous. In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. To learn visual-semantic embedding, we tie-up two PIE-Nets and optimize them jointly in the multiple instance learning framework. Most existing work on cross-modal retrieval focuses on image-text data. Here, we also tackle a more challenging case of video-text retrieval. To facilitate further research in video-text retrieval, we release a new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when). We demonstrate our approach on both image-text and video-text retrieval scenarios using MS-COCO, TGIF, and our new MRW dataset.

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@145.2PVSE
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1084.5PVSE
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@574.3PVSE
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@132.4PVSE
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1075PVSE
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@563PVSE
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@145.2PVSE
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1084.5PVSE
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@574.3PVSE
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@132.4PVSE
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1075PVSE
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@563PVSE
Cross-Modal RetrievalCOCO 2014Image-to-text R@145.2PVSE
Cross-Modal RetrievalCOCO 2014Image-to-text R@1084.5PVSE
Cross-Modal RetrievalCOCO 2014Image-to-text R@574.3PVSE
Cross-Modal RetrievalCOCO 2014Text-to-image R@132.4PVSE
Cross-Modal RetrievalCOCO 2014Text-to-image R@1075PVSE
Cross-Modal RetrievalCOCO 2014Text-to-image R@563PVSE

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15