Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Retrieval | GeneCIS | A-R@1 | 15.9 | CIReVL (CLIP B/32) |
| Image Retrieval | GeneCIS | A-R@1 | 15.9 | CIReVL (CLIP L/14) |
| Image Retrieval | GeneCIS | A-R@1 | 17.4 | CIReVL (CLIP G/14) |
| Image Retrieval | Fashion IQ | (Recall@10+Recall@50)/2 | 42.28 | CIReVL (CLIP G/14) |
| Image Retrieval | Fashion IQ | (Recall@10+Recall@50)/2 | 38.82 | CIReVL (CLIP B/32) |
| Image Retrieval | Fashion IQ | (Recall@10+Recall@50)/2 | 38.56 | CIReVL (CLIP L/14) |
| Image Retrieval | CIRCO | mAP@10 | 27.59 | CIReVL (CLIP G/14) |
| Image Retrieval | CIRCO | mAP@10 | 19.01 | CIReVL (CLIP L/14) |
| Image Retrieval | CIRCO | mAP@10 | 15.42 | CIReVL (CLIP B/32) |
| Image Retrieval | CIRR | R@1 | 34.65 | CIReVL (CLIP G/14) |
| Image Retrieval | CIRR | R@5 | 64.29 | CIReVL (CLIP G/14) |
| Image Retrieval | CIRR | R@1 | 24.55 | CIReVL (CLIP L/14) |
| Image Retrieval | CIRR | R@5 | 52.31 | CIReVL (CLIP L/14) |
| Image Retrieval | CIRR | R@1 | 23.94 | CIReVL (CLIP B/32) |
| Image Retrieval | CIRR | R@5 | 52.51 | CIReVL (CLIP B/32) |
| Composed Image Retrieval (CoIR) | GeneCIS | A-R@1 | 15.9 | CIReVL (CLIP B/32) |
| Composed Image Retrieval (CoIR) | GeneCIS | A-R@1 | 15.9 | CIReVL (CLIP L/14) |
| Composed Image Retrieval (CoIR) | GeneCIS | A-R@1 | 17.4 | CIReVL (CLIP G/14) |
| Composed Image Retrieval (CoIR) | Fashion IQ | (Recall@10+Recall@50)/2 | 42.28 | CIReVL (CLIP G/14) |
| Composed Image Retrieval (CoIR) | Fashion IQ | (Recall@10+Recall@50)/2 | 38.82 | CIReVL (CLIP B/32) |
| Composed Image Retrieval (CoIR) | Fashion IQ | (Recall@10+Recall@50)/2 | 38.56 | CIReVL (CLIP L/14) |
| Composed Image Retrieval (CoIR) | CIRCO | mAP@10 | 27.59 | CIReVL (CLIP G/14) |
| Composed Image Retrieval (CoIR) | CIRCO | mAP@10 | 19.01 | CIReVL (CLIP L/14) |
| Composed Image Retrieval (CoIR) | CIRCO | mAP@10 | 15.42 | CIReVL (CLIP B/32) |
| Composed Image Retrieval (CoIR) | CIRR | R@1 | 34.65 | CIReVL (CLIP G/14) |
| Composed Image Retrieval (CoIR) | CIRR | R@5 | 64.29 | CIReVL (CLIP G/14) |
| Composed Image Retrieval (CoIR) | CIRR | R@1 | 24.55 | CIReVL (CLIP L/14) |
| Composed Image Retrieval (CoIR) | CIRR | R@5 | 52.31 | CIReVL (CLIP L/14) |
| Composed Image Retrieval (CoIR) | CIRR | R@1 | 23.94 | CIReVL (CLIP B/32) |
| Composed Image Retrieval (CoIR) | CIRR | R@5 | 52.51 | CIReVL (CLIP B/32) |