Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan

2025-05-20multimodal generation Multimodal Reasoning Image Editing Image Generation Image Manipulation

Paper PDF Code Code

Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Results

Task	Dataset	Metric	Value	Model
Image Generation	WISE	Biology	0.65	Bagel (w/ cot)
Image Generation	WISE	Chemistry	0.58	Bagel (w/ cot)
Image Generation	WISE	Cultural	0.76	Bagel (w/ cot)
Image Generation	WISE	Overall	0.7	Bagel (w/ cot)
Image Generation	WISE	Physics	0.75	Bagel (w/ cot)
Image Generation	WISE	Space	0.75	Bagel (w/ cot)
Image Generation	WISE	Time	0.69	Bagel (w/ cot)
Image Generation	WISE	Biology	0.44	Bagel
Image Generation	WISE	Chemistry	0.39	Bagel
Image Generation	WISE	Cultural	0.44	Bagel
Image Generation	WISE	Overall	0.52	Bagel
Image Generation	WISE	Physics	0.6	Bagel
Image Generation	WISE	Space	0.68	Bagel
Image Generation	WISE	Time	0.55	Bagel
Image Editing	ImgEdit-Data	Action	4.17	BAGEL
Image Editing	ImgEdit-Data	Add	3.56	BAGEL
Image Editing	ImgEdit-Data	Adjust	3.31	BAGEL
Image Editing	ImgEdit-Data	Background	3.24	BAGEL
Image Editing	ImgEdit-Data	Extract	1.7	BAGEL
Image Editing	ImgEdit-Data	Hybrid	2.38	BAGEL
Image Editing	ImgEdit-Data	Overall	3.2	BAGEL
Image Editing	ImgEdit-Data	Remove	2.62	BAGEL
Image Editing	ImgEdit-Data	Replace	3.3	BAGEL
Image Editing	ImgEdit-Data	Style	4.49	BAGEL
Image Editing	GEdit-Bench-EN	Overall	6.52	BAGEL
Image Editing	GEdit-Bench-EN	Perceptual Quality	6.83	BAGEL
Image Editing	GEdit-Bench-EN	Semantic Consistency	7.36	BAGEL

Emerging Properties in Unified Multimodal Pretraining

Abstract

Results

Related Papers

Emerging Properties in Unified Multimodal Pretraining

Abstract

Results

Related Papers