Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, WeiHao Wang, Kevin Qinghong Lin, YuChao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou

2024-08-22Question Answering Text-to-Image Generation Text to Image Generation Image Generation 10-shot image generation Visual Question Answering

Paper PDF Code

Abstract

We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at https://github.com/showlab/Show-o.

Results

Task	Dataset	Metric	Value	Model
Image Generation	WISE	Biology	0.3	Show-o
Image Generation	WISE	Chemistry	0.3	Show-o
Image Generation	WISE	Cultural	0.28	Show-o
Image Generation	WISE	Overall	0.35	Show-o
Image Generation	WISE	Physics	0.46	Show-o
Image Generation	WISE	Space	0.48	Show-o
Image Generation	WISE	Time	0.4	Show-o
Image Generation	GenEval	Overall	0.68	Und. and Gen. Show-o (Ours)
Text-to-Image Generation	GenEval	Overall	0.68	Und. and Gen. Show-o (Ours)
10-shot image generation	GenEval	Overall	0.68	Und. and Gen. Show-o (Ours)
1 Image, 2*2 Stitchi	GenEval	Overall	0.68	Und. and Gen. Show-o (Ours)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17 FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17 A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17