TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Zero-Shot Composed Image Retrieval with Textual Inversion

Zero-Shot Composed Image Retrieval with Textual Inversion

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto del Bimbo

2023-03-27ICCV 2023 1Composed Image Retrieval (CoIR)RetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval
PaperPDFCode(official)Code(official)

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.

Results

TaskDatasetMetricValueModel
Image RetrievalGeneCIS A-R@114.4SEARLE (CLIP B/32)
Image RetrievalGeneCIS A-R@114.4SEARLE (CLIP L/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/237.76SEARLE-XL-OTI (CLIP L/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/235.9SEARLE-XL (CLIP L/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/232.71SEARLE (CLIP B/32)
Image RetrievalFashion IQ(Recall@10+Recall@50)/232.39SEARLE-OTI (CLIP B/32)
Image RetrievalImageNetAverage Recall21.54SEARLE-XL (CLIP L/14)
Image RetrievalImageNetAverage Recall20.42SEARLE-XL-OTI (CLIP B/32)
Image RetrievalImageNetAverage Recall12.77SEARLE-OTI (CLIP B/32)
Image RetrievalImageNetAverage Recall11.94SEARLE (CLIP B/32)
Image RetrievalCIRCOmAP@1012.73SEARLE-XL (CLIP L/14)
Image RetrievalCIRCOmAP@109.94SEARLE (CLIP B/32)
Image RetrievalCOCO (Common Objects in Context)Actions Recall@531.43SEARLE-XL-OTI (CLIP L/14)
Image RetrievalCOCO (Common Objects in Context)Actions Recall@529.02SEARLE-XL (CLIP L/14)
Image RetrievalCOCO (Common Objects in Context)Actions Recall@526SEARLE-OTI (CLIP B/32)
Image RetrievalCOCO (Common Objects in Context)Actions Recall@524.58SEARLE (CLIP B/32)
Image RetrievalFashionIQR@1027.61SEARLE-XL-OTI
Image RetrievalImageNet-R(Recall@10+Recall@50)/221.54SEARLE-XL (CLIP L/14)
Image RetrievalImageNet-R(Recall@10+Recall@50)/220.42SEARLE-XL-OTI (CLIP B/32)
Image RetrievalImageNet-R(Recall@10+Recall@50)/212.77SEARLE-OTI (CLIP B/32)
Image RetrievalImageNet-R(Recall@10+Recall@50)/211.94SEARLE (CLIP B/32)
Image RetrievalCIRRR@553.42SEARLE
Image RetrievalCIRRR@552.48SEARLE-XL
Composed Image Retrieval (CoIR)GeneCIS A-R@114.4SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)GeneCIS A-R@114.4SEARLE (CLIP L/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/237.76SEARLE-XL-OTI (CLIP L/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/235.9SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/232.71SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/232.39SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)ImageNetAverage Recall21.54SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)ImageNetAverage Recall20.42SEARLE-XL-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)ImageNetAverage Recall12.77SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)ImageNetAverage Recall11.94SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)CIRCOmAP@1012.73SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@109.94SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)COCO (Common Objects in Context)Actions Recall@531.43SEARLE-XL-OTI (CLIP L/14)
Composed Image Retrieval (CoIR)COCO (Common Objects in Context)Actions Recall@529.02SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)COCO (Common Objects in Context)Actions Recall@526SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)COCO (Common Objects in Context)Actions Recall@524.58SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)FashionIQR@1027.61SEARLE-XL-OTI
Composed Image Retrieval (CoIR)ImageNet-R(Recall@10+Recall@50)/221.54SEARLE-XL (CLIP L/14)
Composed Image Retrieval (CoIR)ImageNet-R(Recall@10+Recall@50)/220.42SEARLE-XL-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)ImageNet-R(Recall@10+Recall@50)/212.77SEARLE-OTI (CLIP B/32)
Composed Image Retrieval (CoIR)ImageNet-R(Recall@10+Recall@50)/211.94SEARLE (CLIP B/32)
Composed Image Retrieval (CoIR)CIRRR@553.42SEARLE
Composed Image Retrieval (CoIR)CIRRR@552.48SEARLE-XL

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16