Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, Tim Salimans

2024-10-25Video Prediction Image Generation

Abstract

Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can be very competitive to latent models both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128, ImageNet256 and Kinetics600. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss-weighting (Kingma & Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at a high resolution with fewer parameters, rather than using more parameters at a lower resolution. Combining these with guidance intervals, we obtain a family of pixel-space diffusion models we call Simpler Diffusion (SiD2).

Results

Task	Dataset	Metric	Value	Model
Image Generation	ImageNet 128x128	FID	1.26	SiD2
Image Generation	ImageNet 512x512	FID	1.48	SiD2
Image Generation	ImageNet 256x256	FID	1.38	SiD2
Video	Kinetics-600 12 frames, 64x64	Cond	5	SiD2
Video	Kinetics-600 12 frames, 64x64	FVD	2.3	SiD2
Video	Kinetics-600 12 frames, 64x64	Pred	11	SiD2
Video Prediction	Kinetics-600 12 frames, 64x64	Cond	5	SiD2
Video Prediction	Kinetics-600 12 frames, 64x64	FVD	2.3	SiD2
Video Prediction	Kinetics-600 12 frames, 64x64	Pred	11	SiD2

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17 FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17 A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 FADE: Adversarial Concept Erasure in Flow Models2025-07-16 CharaConsist: Fine-Grained Consistent Character Generation2025-07-15 CATVis: Context-Aware Thought Visualization2025-07-15