TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Language-only Efficient Training of Zero-shot Composed Ima...

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun

2023-12-04RetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval
PaperPDFCode(official)

Abstract

Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir

Results

TaskDatasetMetricValueModel
Image RetrievalGeneCIS A-R@113.7LinCIR (CLIP G/14)
Image RetrievalGeneCIS A-R@112.2LinCIR (CLIP L/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/255.4LinCIR (CLIP G/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/236.39LinCIR (CLIP L/14)
Image RetrievalImageNetAverage Recall21.64LinCIR (CLIP L/14)
Image RetrievalCIRCOmAP@1021.01LinCIR (CLIP G/14)
Image RetrievalCIRCOmAP@1013.58LinCIR (CLIP L/14)
Image RetrievalImageNet-R(Recall@10+Recall@50)/221.64LinCIR (CLIP L/14)
Image RetrievalCIRRR@564.72LinCIR (CLIP G/14)
Image RetrievalCIRRR@553.25LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)GeneCIS A-R@113.7LinCIR (CLIP G/14)
Composed Image Retrieval (CoIR)GeneCIS A-R@112.2LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/255.4LinCIR (CLIP G/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/236.39LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)ImageNetAverage Recall21.64LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1021.01LinCIR (CLIP G/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1013.58LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)ImageNet-R(Recall@10+Recall@50)/221.64LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)CIRRR@564.72LinCIR (CLIP G/14)
Composed Image Retrieval (CoIR)CIRRR@553.25LinCIR (CLIP L/14)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16