Emu: Generative Pretraining in Multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

2023-07-11Question Answering Text-to-Image Generation Text Generation Temporal/Casual QA Image to text Text to Image Generation Video Question Answering Image Captioning Image Generation Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code Code(official)

Abstract

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA	WUPS	23.4	Emu(0-shot)
Visual Question Answering (VQA)	InfiMM-Eval	Abductive	36.57	Emu
Visual Question Answering (VQA)	InfiMM-Eval	Analogical	18.19	Emu
Visual Question Answering (VQA)	InfiMM-Eval	Deductive	28.9	Emu
Visual Question Answering (VQA)	InfiMM-Eval	Overall score	28.24	Emu
Visual Question Answering (VQA)	VQA v2	Accuracy	57.5	Emu-I *
Visual Question Answering (VQA)	VizWiz	Accuracy	38.1	Emu-I *
Visual Question Answering	VQA v2	Accuracy	57.5	Emu-I *
Visual Question Answering	VizWiz	Accuracy	38.1	Emu-I *

Emu: Generative Pretraining in Multimodality

Abstract

Results

Related Papers

Emu: Generative Pretraining in Multimodality

Abstract

Results

Related Papers