SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju

2025-01-12WACV 2025 3Retrieval Zero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval

Paper PDF

Abstract

Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	49.24	SCOT (WACV 2025)
Image Retrieval	Fashion IQ	R@10	38.45	SCOT (WACV 2025)
Image Retrieval	Fashion IQ	R@50	60.03	SCOT (WACV 2025)
Image Retrieval	CIRCO	mAP@10	37.88	SCOT (WACV 2025)
Image Retrieval	CIRR	R@1	36.82	SCOT (WACV 2025)
Image Retrieval	CIRR	R@10	74.48	SCOT (WACV 2025)
Image Retrieval	CIRR	R@5	64.34	SCOT (WACV 2025)
Image Retrieval	CIRR	R@50	93.42	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	49.24	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	Fashion IQ	R@10	38.45	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	Fashion IQ	R@50	60.03	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	CIRCO	mAP@10	37.88	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	CIRR	R@1	36.82	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	CIRR	R@10	74.48	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	CIRR	R@5	64.34	SCOT (WACV 2025)
Composed Image Retrieval (CoIR)	CIRR	R@50	93.42	SCOT (WACV 2025)

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Abstract

Results

Related Papers

SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

Abstract

Results

Related Papers