TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SCOT: Self-Supervised Contrastive Pretraining For Zero-Sho...

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju

2025-01-12WACV 2025 3RetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval
PaperPDF

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

Results

TaskDatasetMetricValueModel
Image RetrievalFashion IQ(Recall@10+Recall@50)/249.24SCOT (WACV 2025)
Image RetrievalFashion IQR@1038.45SCOT (WACV 2025)
Image RetrievalFashion IQR@5060.03SCOT (WACV 2025)
Image RetrievalCIRCOmAP@1037.88SCOT (WACV 2025)
Image RetrievalCIRRR@136.82SCOT (WACV 2025)
Image RetrievalCIRRR@1074.48SCOT (WACV 2025)
Image RetrievalCIRRR@564.34SCOT (WACV 2025)
Image RetrievalCIRRR@5093.42SCOT (WACV 2025)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/249.24SCOT (WACV 2025)
Composed Image Retrieval (CoIR)Fashion IQR@1038.45SCOT (WACV 2025)
Composed Image Retrieval (CoIR)Fashion IQR@5060.03SCOT (WACV 2025)
Composed Image Retrieval (CoIR)CIRCOmAP@1037.88SCOT (WACV 2025)
Composed Image Retrieval (CoIR)CIRRR@136.82SCOT (WACV 2025)
Composed Image Retrieval (CoIR)CIRRR@1074.48SCOT (WACV 2025)
Composed Image Retrieval (CoIR)CIRRR@564.34SCOT (WACV 2025)
Composed Image Retrieval (CoIR)CIRRR@5093.42SCOT (WACV 2025)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16