Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 52.9 | VSE++ (ResNet) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 87.2 | VSE++ (ResNet) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 80.5 | VSE++ (ResNet) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 39.6 | VSE++ (ResNet) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 79.5 | VSE++ (ResNet) |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 70.1 | VSE++ (ResNet) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@1 | 52.9 | VSE++ (ResNet) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@10 | 87.2 | VSE++ (ResNet) |
| Cross-Modal Information Retrieval | Flickr30k | Image-to-text R@5 | 80.5 | VSE++ (ResNet) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@1 | 39.6 | VSE++ (ResNet) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@10 | 79.5 | VSE++ (ResNet) |
| Cross-Modal Information Retrieval | Flickr30k | Text-to-image R@5 | 70.1 | VSE++ (ResNet) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@1 | 52.9 | VSE++ (ResNet) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@10 | 87.2 | VSE++ (ResNet) |
| Cross-Modal Retrieval | Flickr30k | Image-to-text R@5 | 80.5 | VSE++ (ResNet) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@1 | 39.6 | VSE++ (ResNet) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@10 | 79.5 | VSE++ (ResNet) |
| Cross-Modal Retrieval | Flickr30k | Text-to-image R@5 | 70.1 | VSE++ (ResNet) |