TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Emu3: Next-Token Prediction is All You Need

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, BoWen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang

2024-09-27Text-to-Image GenerationPredictionAllImage GenerationVisual Question Answering
PaperPDFCodeCode

Abstract

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

Results

TaskDatasetMetricValueModel
Image GenerationWISEBiology0.41Emu3-gen
Image GenerationWISEChemistry0.27Emu3-gen
Image GenerationWISECultural0.34Emu3-gen
Image GenerationWISEOverall0.39Emu3-gen
Image GenerationWISEPhysics0.45Emu3-gen
Image GenerationWISESpace0.48Emu3-gen
Image GenerationWISETime0.45Emu3-gen
Image GenerationT2I-CompBenchColor0.7913Emu3
Image GenerationT2I-CompBenchShape0.5846Emu3
Image GenerationT2I-CompBenchTexture0.7422Emu3
Image GenerationGenEvalOverall0.66Emu3
Visual Question Answering (VQA)MM-VetGPT-4 score37.2Emu3
Text-to-Image GenerationT2I-CompBenchColor0.7913Emu3
Text-to-Image GenerationT2I-CompBenchShape0.5846Emu3
Text-to-Image GenerationT2I-CompBenchTexture0.7422Emu3
Text-to-Image GenerationGenEvalOverall0.66Emu3
10-shot image generationT2I-CompBenchColor0.7913Emu3
10-shot image generationT2I-CompBenchShape0.5846Emu3
10-shot image generationT2I-CompBenchTexture0.7422Emu3
10-shot image generationGenEvalOverall0.66Emu3
Visual Question AnsweringMM-VetGPT-4 score37.2Emu3
1 Image, 2*2 StitchiT2I-CompBenchColor0.7913Emu3
1 Image, 2*2 StitchiT2I-CompBenchShape0.5846Emu3
1 Image, 2*2 StitchiT2I-CompBenchTexture0.7422Emu3
1 Image, 2*2 StitchiGenEvalOverall0.66Emu3

Related Papers

Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16