TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VSE++: Improving Visual-Semantic Embeddings with Hard Nega...

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler

2017-07-18Cross-Modal RetrievalStructured PredictionVisual ReasoningRetrievalImage Retrieval
PaperPDFCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCode

Abstract

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

Results

TaskDatasetMetricValueModel
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@152.9VSE++ (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1087.2VSE++ (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@580.5VSE++ (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@139.6VSE++ (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1079.5VSE++ (ResNet)
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@570.1VSE++ (ResNet)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@152.9VSE++ (ResNet)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1087.2VSE++ (ResNet)
Cross-Modal Information RetrievalFlickr30kImage-to-text R@580.5VSE++ (ResNet)
Cross-Modal Information RetrievalFlickr30kText-to-image R@139.6VSE++ (ResNet)
Cross-Modal Information RetrievalFlickr30kText-to-image R@1079.5VSE++ (ResNet)
Cross-Modal Information RetrievalFlickr30kText-to-image R@570.1VSE++ (ResNet)
Cross-Modal RetrievalFlickr30kImage-to-text R@152.9VSE++ (ResNet)
Cross-Modal RetrievalFlickr30kImage-to-text R@1087.2VSE++ (ResNet)
Cross-Modal RetrievalFlickr30kImage-to-text R@580.5VSE++ (ResNet)
Cross-Modal RetrievalFlickr30kText-to-image R@139.6VSE++ (ResNet)
Cross-Modal RetrievalFlickr30kText-to-image R@1079.5VSE++ (ResNet)
Cross-Modal RetrievalFlickr30kText-to-image R@570.1VSE++ (ResNet)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16