How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen

2023-02-15Sentence Retrieval Data Augmentation Passage Retrieval Contrastive Learning Retrieval

Abstract

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

Results

Task	Dataset	Metric	Value	Model
Information Retrieval	PeerQA	MRR	0.4845	Dragon+
Information Retrieval	PeerQA	Recall@10	0.6817	Dragon+

Related Papers

Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17