Retrieval-Augmented Multimodal Language Modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih

2022-11-22Text-to-Image Generation Text Generation Caption Generation Image to text Image Captioning Retrieval Image Generation Language Modelling

Paper PDF

Abstract

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

Results

Task	Dataset	Metric	Value	Model
Image Generation	COCO (Common Objects in Context)	FID	12.63	Stable Diffusion
Image Generation	COCO (Common Objects in Context)	FID	15.7	RA-CM3 (2.7B)
Image Generation	COCO (Common Objects in Context)	FID	28	DALL-E (12B)
Image Generation	COCO (Common Objects in Context)	FID	29.5	Vanilla CM3
Image Captioning	COCO (Common Objects in Context)	CIDEr	103	Flamingo (80B; 4-shot)
Image Captioning	COCO (Common Objects in Context)	CIDEr	89.1	RA-CM3 (2.7B)
Image Captioning	COCO (Common Objects in Context)	CIDEr	85	Flamingo (3B; 4-shot)
Image Captioning	COCO (Common Objects in Context)	CIDEr	83.9	Parti
Image Captioning	COCO (Common Objects in Context)	CIDEr	71.9	Vanilla CM3
Image Captioning	COCO (Common Objects in Context)	CIDEr	55.8	X-LXMERT
Image Captioning	COCO (Common Objects in Context)	CIDEr	48	minDALL-E
Image Captioning	COCO (Common Objects in Context)	CIDEr	38.7	ruDALL-E-XL
Image Captioning	COCO (Common Objects in Context)	CIDEr	20.2	DALL-E
Text-to-Image Generation	COCO (Common Objects in Context)	FID	12.63	Stable Diffusion
Text-to-Image Generation	COCO (Common Objects in Context)	FID	15.7	RA-CM3 (2.7B)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	28	DALL-E (12B)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	29.5	Vanilla CM3
10-shot image generation	COCO (Common Objects in Context)	FID	12.63	Stable Diffusion
10-shot image generation	COCO (Common Objects in Context)	FID	15.7	RA-CM3 (2.7B)
10-shot image generation	COCO (Common Objects in Context)	FID	28	DALL-E (12B)
10-shot image generation	COCO (Common Objects in Context)	FID	29.5	Vanilla CM3
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	12.63	Stable Diffusion
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	15.7	RA-CM3 (2.7B)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	28	DALL-E (12B)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	29.5	Vanilla CM3

Retrieval-Augmented Multimodal Language Modeling

Abstract

Results

Related Papers

Retrieval-Augmented Multimodal Language Modeling

Abstract

Results

Related Papers