TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PixArt-$α$: Fast Training of Diffusion Transformer for Pho...

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li

2023-09-30Text-to-Image GenerationImage GenerationLanguage Modelling
PaperPDFCodeCodeCode(official)

Abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Results

TaskDatasetMetricValueModel
Image GenerationWISEBiology0.49PixArt-XL-2-1024-MS
Image GenerationWISEChemistry0.34PixArt-XL-2-1024-MS
Image GenerationWISECultural0.45PixArt-XL-2-1024-MS
Image GenerationWISEOverall0.47PixArt-XL-2-1024-MS
Image GenerationWISEPhysics0.56PixArt-XL-2-1024-MS
Image GenerationWISESpace0.48PixArt-XL-2-1024-MS
Image GenerationWISETime0.5PixArt-XL-2-1024-MS
Image GenerationT2I-CompBenchColor0.6886PixArt-a
Image GenerationT2I-CompBenchComplex0.4117PixArt-a
Image GenerationT2I-CompBenchNon-Spatial0.3179PixArt-a
Image GenerationT2I-CompBenchShape0.5582PixArt-a
Image GenerationT2I-CompBenchSpatial0.2082PixArt-a
Image GenerationT2I-CompBenchTexture0.7044PixArt-a
Text-to-Image GenerationT2I-CompBenchColor0.6886PixArt-a
Text-to-Image GenerationT2I-CompBenchComplex0.4117PixArt-a
Text-to-Image GenerationT2I-CompBenchNon-Spatial0.3179PixArt-a
Text-to-Image GenerationT2I-CompBenchShape0.5582PixArt-a
Text-to-Image GenerationT2I-CompBenchSpatial0.2082PixArt-a
Text-to-Image GenerationT2I-CompBenchTexture0.7044PixArt-a
10-shot image generationT2I-CompBenchColor0.6886PixArt-a
10-shot image generationT2I-CompBenchComplex0.4117PixArt-a
10-shot image generationT2I-CompBenchNon-Spatial0.3179PixArt-a
10-shot image generationT2I-CompBenchShape0.5582PixArt-a
10-shot image generationT2I-CompBenchSpatial0.2082PixArt-a
10-shot image generationT2I-CompBenchTexture0.7044PixArt-a
1 Image, 2*2 StitchiT2I-CompBenchColor0.6886PixArt-a
1 Image, 2*2 StitchiT2I-CompBenchComplex0.4117PixArt-a
1 Image, 2*2 StitchiT2I-CompBenchNon-Spatial0.3179PixArt-a
1 Image, 2*2 StitchiT2I-CompBenchShape0.5582PixArt-a
1 Image, 2*2 StitchiT2I-CompBenchSpatial0.2082PixArt-a
1 Image, 2*2 StitchiT2I-CompBenchTexture0.7044PixArt-a

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17