TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ImageScope: Unifying Language-Guided Image Retrieval via L...

ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning

Pengfei Luo, Jingbo Zhou, Tong Xu, Yuan Xia, Linli Xu, Enhong Chen

2025-03-13RetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval
PaperPDFCode(official)

Abstract

With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs serving as a universal verification to refine the results. To be specific, in the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity using chain-of-thought (CoT) reasoning. In the second and third stages, we then reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally. Experiments conducted on six LGIR datasets demonstrate that ImageScope outperforms competitive baselines. Comprehensive evaluations and ablation studies further confirm the effectiveness of our design.

Results

TaskDatasetMetricValueModel
Image RetrievalFashion IQR@1031.36ImageScope (CLIP-ViT-L/14)
Image RetrievalFashion IQR@5050.78ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRCOMAP@528.36ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRCOmAP@1029.23ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRCOmAP@2530.81ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRCOmAP@5031.88ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRRR@139.37ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRRR@1078.05ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRRR@567.54ImageScope (CLIP-ViT-L/14)
Image RetrievalCIRRR@5092.94ImageScope (CLIP-ViT-L/14)
Image RetrievalVisDialHits@10 on 10 Round79.89ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)Fashion IQR@1031.36ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)Fashion IQR@5050.78ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRCOMAP@528.36ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1029.23ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@2530.81ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@5031.88ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRRR@139.37ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRRR@1078.05ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRRR@567.54ImageScope (CLIP-ViT-L/14)
Composed Image Retrieval (CoIR)CIRRR@5092.94ImageScope (CLIP-ViT-L/14)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16