TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Retrieval-Augmented Multimodal Language Modeling

Retrieval-Augmented Multimodal Language Modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih

2022-11-22Text-to-Image GenerationText GenerationCaption GenerationImage to textImage CaptioningRetrievalImage GenerationLanguage Modelling
PaperPDF

Abstract

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID12.63Stable Diffusion
Image GenerationCOCO (Common Objects in Context)FID15.7RA-CM3 (2.7B)
Image GenerationCOCO (Common Objects in Context)FID28DALL-E (12B)
Image GenerationCOCO (Common Objects in Context)FID29.5Vanilla CM3
Image CaptioningCOCO (Common Objects in Context)CIDEr103Flamingo (80B; 4-shot)
Image CaptioningCOCO (Common Objects in Context)CIDEr89.1RA-CM3 (2.7B)
Image CaptioningCOCO (Common Objects in Context)CIDEr85Flamingo (3B; 4-shot)
Image CaptioningCOCO (Common Objects in Context)CIDEr83.9Parti
Image CaptioningCOCO (Common Objects in Context)CIDEr71.9Vanilla CM3
Image CaptioningCOCO (Common Objects in Context)CIDEr55.8X-LXMERT
Image CaptioningCOCO (Common Objects in Context)CIDEr48minDALL-E
Image CaptioningCOCO (Common Objects in Context)CIDEr38.7ruDALL-E-XL
Image CaptioningCOCO (Common Objects in Context)CIDEr20.2DALL-E
Text-to-Image GenerationCOCO (Common Objects in Context)FID12.63Stable Diffusion
Text-to-Image GenerationCOCO (Common Objects in Context)FID15.7RA-CM3 (2.7B)
Text-to-Image GenerationCOCO (Common Objects in Context)FID28DALL-E (12B)
Text-to-Image GenerationCOCO (Common Objects in Context)FID29.5Vanilla CM3
10-shot image generationCOCO (Common Objects in Context)FID12.63Stable Diffusion
10-shot image generationCOCO (Common Objects in Context)FID15.7RA-CM3 (2.7B)
10-shot image generationCOCO (Common Objects in Context)FID28DALL-E (12B)
10-shot image generationCOCO (Common Objects in Context)FID29.5Vanilla CM3
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID12.63Stable Diffusion
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID15.7RA-CM3 (2.7B)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID28DALL-E (12B)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID29.5Vanilla CM3

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17