TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/An Efficient Post-hoc Framework for Reducing Task Discrepa...

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

Jaeseok Byun, Seokhyeon Jeong, Wonjae Kim, Sanghyuk Chun, Taesup Moon

2024-06-13Contrastive LearningRetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval
PaperPDFCode

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval. However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text $\leftrightarrow$ image) and the target CIR task (image + text $\leftrightarrow$ image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel target-anchored text contrastive learning designed to enhance the capability of the text encoder for CIR. We also propose two key enhancements: (1) a hard negative-based refined batch sampling strategy and (2) a refined concatenation scheme to further mitigate training-inference discrepancy. Integrating RTD into state-of-the-art projection-based methods achieves performance comparable to, or even surpassing, resource-intensive state-of-the-art synthetic CIR triplet-based approaches only with 23 minutes of additional training on 4 A100 GPUs (up to $100\times$ faster in training). Our code will be available upon acceptance.

Results

TaskDatasetMetricValueModel
Image RetrievalFashion IQ(Recall@10+Recall@50)/256.74RTD + LinCIR (CLIP G/14)
Image RetrievalFashion IQ(Recall@10+Recall@50)/240.66RTD + LinCIR (CLIP L/14)
Image RetrievalCIRCOmAP@1022.29RTD + LinCIR (CLIP G/14)
Image RetrievalCIRCOmAP@1018.11RTD + LinCIR (CLIP L/14)
Image RetrievalCIRRR@567.47RTD + LinCIR (CLIP G/14)
Image RetrievalCIRRR@556.17RTD + LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/256.74RTD + LinCIR (CLIP G/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/240.66RTD + LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1022.29RTD + LinCIR (CLIP G/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1018.11RTD + LinCIR (CLIP L/14)
Composed Image Retrieval (CoIR)CIRRR@567.47RTD + LinCIR (CLIP G/14)
Composed Image Retrieval (CoIR)CIRRR@556.17RTD + LinCIR (CLIP L/14)

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17