TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/All are Worth Words: A ViT Backbone for Diffusion Models

All are Worth Words: A ViT Backbone for Diffusion Models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, Jun Zhu

2022-09-25CVPR 2023 1Text-to-Image GenerationText to Image GenerationAllImage GenerationConditional Image Generation
PaperPDFCodeCode(official)Code

Abstract

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID5.48U-ViT-S/2-Deep
Image GenerationCOCO (Common Objects in Context)FID5.95U-ViT-S/2
Text-to-Image GenerationCOCO (Common Objects in Context)FID5.48U-ViT-S/2-Deep
Text-to-Image GenerationCOCO (Common Objects in Context)FID5.95U-ViT-S/2
10-shot image generationCOCO (Common Objects in Context)FID5.48U-ViT-S/2-Deep
10-shot image generationCOCO (Common Objects in Context)FID5.95U-ViT-S/2
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID5.48U-ViT-S/2-Deep
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID5.95U-ViT-S/2

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16CharaConsist: Fine-Grained Consistent Character Generation2025-07-15Modeling Code: Is Text All You Need?2025-07-15