Data Roaming and Quality Assessment for Composed Image Retrieval

Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski

2023-03-16Composed Image Retrieval (CoIR)Retrieval Image Retrieval

Abstract

The task of Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively. However, current CoIR datasets are orders of magnitude smaller compared to other vision and language (V&L) datasets. Additionally, some of these datasets have noticeable issues, such as queries containing redundant modalities. To address these shortcomings, we introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones. Pre-training on our LaSCo, shows a noteworthy improvement in performance, even in zero-shot. Furthermore, we propose a new approach for analyzing CoIR datasets and methods, which detects modality redundancy or necessity, in queries. We also introduce a new CoIR baseline, the Cross-Attention driven Shift Encoder (CASE). This baseline allows for early fusion of modalities using a cross-attention module and employs an additional auxiliary task during training. Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like FashionIQ and CIRR.

Results

Task	Dataset	Metric	Value	Model
Image Retrieval	LaSCo	Recall@1 (%)	7.08	CASE
Image Retrieval	LaSCo	Recall@1 (%)	4.26	BLIP4CIR
Image Retrieval	Fashion IQ	(Recall@10+Recall@50)/2	59.73	CASE
Image Retrieval	Fashion IQ	Recall@10	48.79	CASE
Image Retrieval	CIRR	(Recall@5+Recall_subset@1)/2	78.25	CASE (Pre-trained on LaSCo.Ca)
Image Retrieval	CIRR	Recall@10	88.75	CASE (Pre-trained on LaSCo.Ca)
Image Retrieval	CIRR	(Recall@5+Recall_subset@1)/2	77.5	CASE
Image Retrieval	CIRR	Recall@10	87.25	CASE

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Context-Aware Search and Retrieval Over Erasure Channels2025-07-16