Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, BoWen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang

2024-09-27Text-to-Image Generation Prediction All Image Generation Visual Question Answering

Paper PDF Code Code

Abstract

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

Results

Task	Dataset	Metric	Value	Model
Image Generation	WISE	Biology	0.41	Emu3-gen
Image Generation	WISE	Chemistry	0.27	Emu3-gen
Image Generation	WISE	Cultural	0.34	Emu3-gen
Image Generation	WISE	Overall	0.39	Emu3-gen
Image Generation	WISE	Physics	0.45	Emu3-gen
Image Generation	WISE	Space	0.48	Emu3-gen
Image Generation	WISE	Time	0.45	Emu3-gen
Image Generation	T2I-CompBench	Color	0.7913	Emu3
Image Generation	T2I-CompBench	Shape	0.5846	Emu3
Image Generation	T2I-CompBench	Texture	0.7422	Emu3
Image Generation	GenEval	Overall	0.66	Emu3
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	37.2	Emu3
Text-to-Image Generation	T2I-CompBench	Color	0.7913	Emu3
Text-to-Image Generation	T2I-CompBench	Shape	0.5846	Emu3
Text-to-Image Generation	T2I-CompBench	Texture	0.7422	Emu3
Text-to-Image Generation	GenEval	Overall	0.66	Emu3
10-shot image generation	T2I-CompBench	Color	0.7913	Emu3
10-shot image generation	T2I-CompBench	Shape	0.5846	Emu3
10-shot image generation	T2I-CompBench	Texture	0.7422	Emu3
10-shot image generation	GenEval	Overall	0.66	Emu3
Visual Question Answering	MM-Vet	GPT-4 score	37.2	Emu3
1 Image, 2*2 Stitchi	T2I-CompBench	Color	0.7913	Emu3
1 Image, 2*2 Stitchi	T2I-CompBench	Shape	0.5846	Emu3
1 Image, 2*2 Stitchi	T2I-CompBench	Texture	0.7422	Emu3
1 Image, 2*2 Stitchi	GenEval	Overall	0.66	Emu3

Emu3: Next-Token Prediction is All You Need

Abstract

Results

Related Papers

Emu3: Next-Token Prediction is All You Need

Abstract

Results

Related Papers