TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision-by-Language for Training-Free Compositional Image R...

Vision-by-Language for Training-Free Compositional Image Retrieval

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata

2023-10-13RetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval
PaperPDFCode(official)

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

Results

TaskDatasetMetricValueModel
Image RetrievalGeneCISA-R@115.9CIReVL (CLIP B/32)
Image RetrievalGeneCISA-R@115.9CIReVL (CLIP L/14)
Image RetrievalGeneCISA-R@117.4CIReVL (CLIP G/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/242.28CIReVL (CLIP G/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/238.82CIReVL (CLIP B/32)
Image RetrievalFashion IQ(Recall@10+Recall@50)/238.56CIReVL (CLIP L/14)
Image RetrievalCIRCOmAP@1027.59CIReVL (CLIP G/14)
Image RetrievalCIRCOmAP@1019.01CIReVL (CLIP L/14)
Image RetrievalCIRCOmAP@1015.42CIReVL (CLIP B/32)
Image RetrievalCIRRR@134.65CIReVL (CLIP G/14)
Image RetrievalCIRRR@564.29CIReVL (CLIP G/14)
Image RetrievalCIRRR@124.55CIReVL (CLIP L/14)
Image RetrievalCIRRR@552.31CIReVL (CLIP L/14)
Image RetrievalCIRRR@123.94CIReVL (CLIP B/32)
Image RetrievalCIRRR@552.51CIReVL (CLIP B/32)
Composed Image Retrieval (CoIR)GeneCISA-R@115.9CIReVL (CLIP B/32)
Composed Image Retrieval (CoIR)GeneCISA-R@115.9CIReVL (CLIP L/14)
Composed Image Retrieval (CoIR)GeneCISA-R@117.4CIReVL (CLIP G/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/242.28CIReVL (CLIP G/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/238.82CIReVL (CLIP B/32)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/238.56CIReVL (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1027.59CIReVL (CLIP G/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1019.01CIReVL (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1015.42CIReVL (CLIP B/32)
Composed Image Retrieval (CoIR)CIRRR@134.65CIReVL (CLIP G/14)
Composed Image Retrieval (CoIR)CIRRR@564.29CIReVL (CLIP G/14)
Composed Image Retrieval (CoIR)CIRRR@124.55CIReVL (CLIP L/14)
Composed Image Retrieval (CoIR)CIRRR@552.31CIReVL (CLIP L/14)
Composed Image Retrieval (CoIR)CIRRR@123.94CIReVL (CLIP B/32)
Composed Image Retrieval (CoIR)CIRRR@552.51CIReVL (CLIP B/32)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16