TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Emerging Properties in Unified Multimodal Pretraining

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan

2025-05-20multimodal generationMultimodal ReasoningImage EditingImage GenerationImage Manipulation
PaperPDFCodeCode

Abstract

Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Results

TaskDatasetMetricValueModel
Image GenerationWISEBiology0.65Bagel (w/ cot)
Image GenerationWISEChemistry0.58Bagel (w/ cot)
Image GenerationWISECultural0.76Bagel (w/ cot)
Image GenerationWISEOverall0.7Bagel (w/ cot)
Image GenerationWISEPhysics0.75Bagel (w/ cot)
Image GenerationWISESpace0.75Bagel (w/ cot)
Image GenerationWISETime0.69Bagel (w/ cot)
Image GenerationWISEBiology0.44Bagel
Image GenerationWISEChemistry0.39Bagel
Image GenerationWISECultural0.44Bagel
Image GenerationWISEOverall0.52Bagel
Image GenerationWISEPhysics0.6Bagel
Image GenerationWISESpace0.68Bagel
Image GenerationWISETime0.55Bagel
Image EditingImgEdit-DataAction4.17BAGEL
Image EditingImgEdit-DataAdd3.56BAGEL
Image EditingImgEdit-DataAdjust3.31BAGEL
Image EditingImgEdit-DataBackground3.24BAGEL
Image EditingImgEdit-DataExtract1.7BAGEL
Image EditingImgEdit-DataHybrid2.38BAGEL
Image EditingImgEdit-DataOverall3.2BAGEL
Image EditingImgEdit-DataRemove2.62BAGEL
Image EditingImgEdit-DataReplace3.3BAGEL
Image EditingImgEdit-DataStyle4.49BAGEL
Image EditingGEdit-Bench-ENOverall6.52BAGEL
Image EditingGEdit-Bench-ENPerceptual Quality6.83BAGEL
Image EditingGEdit-Bench-ENSemantic Consistency7.36BAGEL

Related Papers

EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent2025-07-21NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining2025-07-18Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17