TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer...

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li

2024-03-07Text-to-Image GenerationText to Image GenerationImage Captioning4kImage Generation
PaperPDFCode(official)Code

Abstract

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

Results

TaskDatasetMetricValueModel
Image GenerationTextAtlasEvalStyledTextSynth Clip Score0.2764PixArt-Sigma
Image GenerationTextAtlasEvalStyledTextSynth FID82.83PixArt-Sigma
Image GenerationTextAtlasEvalStyledTextSynth OCR (Accuracy)0.42PixArt-Sigma
Image GenerationTextAtlasEvalStyledTextSynth OCR (Cer)0.9PixArt-Sigma
Image GenerationTextAtlasEvalStyledTextSynth OCR (F1 Score)0.62PixArt-Sigma
Image GenerationTextAtlasEvalTextScenesHQ Clip Score0.2347PixArt-Sigma
Image GenerationTextAtlasEvalTextScenesHQ FID72.62PixArt-Sigma
Image GenerationTextAtlasEvalTextScenesHQ OCR (Accuracy)0.34PixArt-Sigma
Image GenerationTextAtlasEvalTextScenesHQ OCR (Cer)0.91PixArt-Sigma
Image GenerationTextAtlasEvalTextScenesHQ OCR (F1 Score)0.53PixArt-Sigma
Image GenerationTextAtlasEvalTextVisionBlend Clip Score0.1891PixArt-Sigma
Image GenerationTextAtlasEvalTextVisionBlend FID81.29PixArt-Sigma
Image GenerationTextAtlasEvalTextVisionBlend OCR (Accuracy)2.4PixArt-Sigma
Image GenerationTextAtlasEvalTextVisionBlend OCR (Cer)0.83PixArt-Sigma
Image GenerationTextAtlasEvalTextVsionBlend OCR (F1 Score)1.57PixArt-Sigma
Image GenerationGenEvalOverall0.53PixArt-Σ
Text-to-Image GenerationGenEvalOverall0.53PixArt-Σ
10-shot image generationGenEvalOverall0.53PixArt-Σ
1 Image, 2*2 StitchiGenEvalOverall0.53PixArt-Σ

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16FADE: Adversarial Concept Erasure in Flow Models2025-07-16CharaConsist: Fine-Grained Consistent Character Generation2025-07-15