TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Show-o: One Single Transformer to Unify Multimodal Underst...

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, WeiHao Wang, Kevin Qinghong Lin, YuChao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou

2024-08-22Question AnsweringText-to-Image GenerationText to Image GenerationImage Generation10-shot image generationVisual Question Answering
PaperPDFCode

Abstract

We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at https://github.com/showlab/Show-o.

Results

TaskDatasetMetricValueModel
Image GenerationWISEBiology0.3Show-o
Image GenerationWISEChemistry0.3Show-o
Image GenerationWISECultural0.28Show-o
Image GenerationWISEOverall0.35Show-o
Image GenerationWISEPhysics0.46Show-o
Image GenerationWISESpace0.48Show-o
Image GenerationWISETime0.4Show-o
Image GenerationGenEvalOverall0.68Und. and Gen. Show-o (Ours)
Text-to-Image GenerationGenEvalOverall0.68Und. and Gen. Show-o (Ours)
10-shot image generationGenEvalOverall0.68Und. and Gen. Show-o (Ours)
1 Image, 2*2 StitchiGenEvalOverall0.68Und. and Gen. Show-o (Ours)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17