Zero-Shot Composed Image Retrieval with Textual Inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto del Bimbo

2023-03-27ICCV 2023 1Composed Image Retrieval (CoIR)Retrieval Zero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval

Paper PDF Code(official)Code(official)

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	GeneCIS	A-R@1	14.4	SEARLE (CLIP B/32)
Image Retrieval	GeneCIS	A-R@1	14.4	SEARLE (CLIP L/14)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	37.76	SEARLE-XL-OTI (CLIP L/14)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	35.9	SEARLE-XL (CLIP L/14)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	32.71	SEARLE (CLIP B/32)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	32.39	SEARLE-OTI (CLIP B/32)
Image Retrieval	ImageNet	Average Recall	21.54	SEARLE-XL (CLIP L/14)
Image Retrieval	ImageNet	Average Recall	20.42	SEARLE-XL-OTI (CLIP B/32)
Image Retrieval	ImageNet	Average Recall	12.77	SEARLE-OTI (CLIP B/32)
Image Retrieval	ImageNet	Average Recall	11.94	SEARLE (CLIP B/32)
Image Retrieval	CIRCO	mAP@10	12.73	SEARLE-XL (CLIP L/14)
Image Retrieval	CIRCO	mAP@10	9.94	SEARLE (CLIP B/32)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	31.43	SEARLE-XL-OTI (CLIP L/14)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	29.02	SEARLE-XL (CLIP L/14)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	26	SEARLE-OTI (CLIP B/32)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	24.58	SEARLE (CLIP B/32)
Image Retrieval	FashionIQ	R@10	27.61	SEARLE-XL-OTI
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	21.54	SEARLE-XL (CLIP L/14)
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	20.42	SEARLE-XL-OTI (CLIP B/32)
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	12.77	SEARLE-OTI (CLIP B/32)
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	11.94	SEARLE (CLIP B/32)
Image Retrieval	CIRR	R@5	53.42	SEARLE
Image Retrieval	CIRR	R@5	52.48	SEARLE-XL
Composed Image Retrieval (CoIR)	GeneCIS	A-R@1	14.4	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	GeneCIS	A-R@1	14.4	SEARLE (CLIP L/14)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	37.76	SEARLE-XL-OTI (CLIP L/14)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	35.9	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	32.71	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	32.39	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	21.54	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	20.42	SEARLE-XL-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	12.77	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	11.94	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	CIRCO	mAP@10	12.73	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	CIRCO	mAP@10	9.94	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	31.43	SEARLE-XL-OTI (CLIP L/14)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	29.02	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	26	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	24.58	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	FashionIQ	R@10	27.61	SEARLE-XL-OTI
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	21.54	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	20.42	SEARLE-XL-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	12.77	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	11.94	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	CIRR	R@5	53.42	SEARLE
Composed Image Retrieval (CoIR)	CIRR	R@5	52.48	SEARLE-XL

Abstract

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	GeneCIS	A-R@1	14.4	SEARLE (CLIP B/32)
Image Retrieval	GeneCIS	A-R@1	14.4	SEARLE (CLIP L/14)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	37.76	SEARLE-XL-OTI (CLIP L/14)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	35.9	SEARLE-XL (CLIP L/14)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	32.71	SEARLE (CLIP B/32)
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	32.39	SEARLE-OTI (CLIP B/32)
Image Retrieval	ImageNet	Average Recall	21.54	SEARLE-XL (CLIP L/14)
Image Retrieval	ImageNet	Average Recall	20.42	SEARLE-XL-OTI (CLIP B/32)
Image Retrieval	ImageNet	Average Recall	12.77	SEARLE-OTI (CLIP B/32)
Image Retrieval	ImageNet	Average Recall	11.94	SEARLE (CLIP B/32)
Image Retrieval	CIRCO	mAP@10	12.73	SEARLE-XL (CLIP L/14)
Image Retrieval	CIRCO	mAP@10	9.94	SEARLE (CLIP B/32)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	31.43	SEARLE-XL-OTI (CLIP L/14)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	29.02	SEARLE-XL (CLIP L/14)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	26	SEARLE-OTI (CLIP B/32)
Image Retrieval	COCO (Common Objects in Context)	Actions Recall@5	24.58	SEARLE (CLIP B/32)
Image Retrieval	FashionIQ	R@10	27.61	SEARLE-XL-OTI
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	21.54	SEARLE-XL (CLIP L/14)
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	20.42	SEARLE-XL-OTI (CLIP B/32)
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	12.77	SEARLE-OTI (CLIP B/32)
Image Retrieval	ImageNet-R	(Recall@10+Recall@50)/2	11.94	SEARLE (CLIP B/32)
Image Retrieval	CIRR	R@5	53.42	SEARLE
Image Retrieval	CIRR	R@5	52.48	SEARLE-XL
Composed Image Retrieval (CoIR)	GeneCIS	A-R@1	14.4	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	GeneCIS	A-R@1	14.4	SEARLE (CLIP L/14)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	37.76	SEARLE-XL-OTI (CLIP L/14)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	35.9	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	32.71	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	Fashion IQ	(Recall@10+Recall@50)/2	32.39	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	21.54	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	20.42	SEARLE-XL-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	12.77	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet	Average Recall	11.94	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	CIRCO	mAP@10	12.73	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	CIRCO	mAP@10	9.94	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	31.43	SEARLE-XL-OTI (CLIP L/14)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	29.02	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	26	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	COCO (Common Objects in Context)	Actions Recall@5	24.58	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	FashionIQ	R@10	27.61	SEARLE-XL-OTI
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	21.54	SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	20.42	SEARLE-XL-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	12.77	SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)	ImageNet-R	(Recall@10+Recall@50)/2	11.94	SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)	CIRR	R@5	53.42	SEARLE
Composed Image Retrieval (CoIR)	CIRR	R@5	52.48	SEARLE-XL

Zero-Shot Composed Image Retrieval with Textual Inversion

Abstract

Results

Related Papers

Zero-Shot Composed Image Retrieval with Textual Inversion

Abstract

Results

Related Papers