TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Pretrain like Your Inference: Masked Tuning Improves Zero-...

Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval

Junyang Chen, Hanjiang Lai

2023-11-13Contrastive LearningRetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Language ModellingImage Retrieval
PaperPDFCode(official)

Abstract

Zero-shot composed image retrieval (ZS-CIR), which takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling, has gained more and more attention in data mining. Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models, e.g., CLIP. However, the pre-trained vision-language models and CIR tasks have substantial discrepancies, where the vision-language models focus on learning the similarities but CIR aims to learn the modifications of the image guided by text. In this paper, we introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task. First, to reduce the gap, we reformulate the contrastive learning of the vision-language model as the CIR task, where we randomly mask input image patches to generate $\langle$masked image, text, image$\rangle$ triplet from an image-text pair. Then, we propose a simple but novel pre-trained masked tuning method, which uses the text and the masked image to learn the modifications of the original image. With such a simple design, the proposed masked tuning can learn to better capture fine-grained text-guided modifications. Extensive experimental results demonstrate the significant superiority of our approach over the baseline models on four ZS-CIR datasets, including FashionIQ, CIRR, CIRCO, and GeneCIS. Our codes are available at https://github.com/Chen-Junyang-cn/PLI

Results

TaskDatasetMetricValueModel
Image RetrievalFashion IQ(Recall@10+Recall@50)/246.42MTCIR (CLIP L/14)
Image RetrievalCIRCOmAP@1011.63MTCIR (CLIP L/14)
Image RetrievalCIRCOmAP@108.03MTCIR (BLIP B/16)
Image RetrievalCIRRR@558.87MTCIR (BLIP B/16)
Image RetrievalCIRRR@554.58MTCIR (CLIP L/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/246.42MTCIR (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1011.63MTCIR (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@108.03MTCIR (BLIP B/16)
Composed Image Retrieval (CoIR)CIRRR@558.87MTCIR (BLIP B/16)
Composed Image Retrieval (CoIR)CIRRR@554.58MTCIR (CLIP L/14)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17