TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CoLLM: A Large Language Model for Composed Image Retrieval

CoLLM: A Large Language Model for Composed Image Retrieval

Chuong Huynh, Jinyu Yang, Ashish Tawari, Mubarak Shah, Son Tran, Raffay Hamid, Trishul Chilimbi, Abhinav Shrivastava

2025-03-25CVPR 2025 1Large Language ModelRetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Language ModellingImage Retrieval
PaperPDFCode(official)

Abstract

Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.

Results

TaskDatasetMetricValueModel
Image RetrievalFashion IQ(Recall@10+Recall@50)/249.9CoLLM (finetuned - BLIP-L/16)
Image RetrievalFashion IQR@1039.1CoLLM (finetuned - BLIP-L/16)
Image RetrievalFashion IQR@5060.7CoLLM (finetuned - BLIP-L/16)
Image RetrievalFashion IQ(Recall@10+Recall@50)/245.3CoLLM (Pretrained - BLIP-L/16)
Image RetrievalFashion IQR@1034.6CoLLM (Pretrained - BLIP-L/16)
Image RetrievalFashion IQR@5056CoLLM (Pretrained - BLIP-L/16)
Image RetrievalFashion IQ(Recall@10+Recall@50)/239.8CoLLM (Pretrained - CLIP-L/14)
Image RetrievalFashion IQR@1030.1CoLLM (Pretrained - CLIP-L/14)
Image RetrievalFashion IQR@5049.5CoLLM (Pretrained - CLIP-L/14)
Image RetrievalCIRCOMAP@520.3CoLLM (Pretrained - CLIP-L/14)
Image RetrievalCIRCOmAP@1020.8CoLLM (Pretrained - CLIP-L/14)
Image RetrievalCIRCOmAP@5023.4CoLLM (Pretrained - CLIP-L/14)
Image RetrievalCIRCOMAP@519.7CoLLM (Pretrained - BLIP-L/16)
Image RetrievalCIRCOmAP@1020.4CoLLM (Pretrained - BLIP-L/16)
Image RetrievalCIRCOmAP@5023.1CoLLM (Pretrained - BLIP-L/16)
Image RetrievalCIRRR@145.8CoLLM (finetuned - BLIP-L/16)
Image RetrievalCIRRR@1084.7CoLLM (finetuned - BLIP-L/16)
Image RetrievalCIRRR@5095.8CoLLM (finetuned - BLIP-L/16)
Image RetrievalCIRRR@135CoLLM (Pretrained - BLIP-L/16)
Image RetrievalCIRRR@1078.6CoLLM (Pretrained - BLIP-L/16)
Image RetrievalCIRRR@5094.2CoLLM (Pretrained - BLIP-L/16)
Image RetrievalCIRRR@129.7CoLLM (Pretrained - CLIP-L/14)
Image RetrievalCIRRR@1072.8CoLLM (Pretrained - CLIP-L/14)
Image RetrievalCIRRR@5091.5CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/249.9CoLLM (finetuned - BLIP-L/16)
Composed Image Retrieval (CoIR)Fashion IQR@1039.1CoLLM (finetuned - BLIP-L/16)
Composed Image Retrieval (CoIR)Fashion IQR@5060.7CoLLM (finetuned - BLIP-L/16)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/245.3CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)Fashion IQR@1034.6CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)Fashion IQR@5056CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)Fashion IQ(Recall@10+Recall@50)/239.8CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)Fashion IQR@1030.1CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)Fashion IQR@5049.5CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)CIRCOMAP@520.3CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1020.8CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@5023.4CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)CIRCOMAP@519.7CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRCOmAP@1020.4CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRCOmAP@5023.1CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRRR@145.8CoLLM (finetuned - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRRR@1084.7CoLLM (finetuned - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRRR@5095.8CoLLM (finetuned - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRRR@135CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRRR@1078.6CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRRR@5094.2CoLLM (Pretrained - BLIP-L/16)
Composed Image Retrieval (CoIR)CIRRR@129.7CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)CIRRR@1072.8CoLLM (Pretrained - CLIP-L/14)
Composed Image Retrieval (CoIR)CIRRR@5091.5CoLLM (Pretrained - CLIP-L/14)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17