TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MegaPairs: Massive Data Synthesis For Universal Multimodal...

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong

2024-12-19RetrievalZero-Shot Composed Image Retrieval (ZS-CIR)Image Retrieval
PaperPDFCode

Abstract

Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.

Results

TaskDatasetMetricValueModel
Image RetrievalFashion IQ(Recall@10+Recall@50)/246.1MMRet-MLLM
Image RetrievalFashion IQRecall@1035.6MMRet-MLLM
Image RetrievalCIRR(Recall@5+Recall_subset@1)/275.7MMRet-MLLM
Image RetrievalCIRRRecall@1085.1MMRet-MLLM
Image RetrievalCIRCOmAP@1043.4MMRet-MLLM
Image RetrievalCIRCOmAP@1040.2MMRet-Large (CLIP L/14)
Image RetrievalCIRCOmAP@1035MMRet-Base (CLIP B/16)
Composed Image Retrieval (CoIR)CIRCOmAP@1043.4MMRet-MLLM
Composed Image Retrieval (CoIR)CIRCOmAP@1040.2MMRet-Large (CLIP L/14)
Composed Image Retrieval (CoIR)CIRCOmAP@1035MMRet-Base (CLIP B/16)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16