TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/High-Resolution Image Synthesis with Latent Diffusion Models

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

2021-12-20CVPR 2022 1DenoisingSuper-ResolutionText-to-Image GenerationVocal Bursts Intensity PredictionLayout-to-Image GenerationImage ReconstructionImage InpaintingUnconditional Image GenerationImage Generation
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .

Results

TaskDatasetMetricValueModel
Image GenerationCelebA-HQ 256x256FID5.11LDM-4
Image GenerationImageNet 512x512FID3.6Latent Diffusion (LDM-4-G)
Image GenerationImageNet 512x512Inception score247.67Latent Diffusion (LDM-4-G)
Image GenerationCOCO (Common Objects in Context)FID12.63Latent Diffusion (LDM-KL-8-G)
Image GenerationConceptual CaptionsFID17.01LDM-4
Image GenerationDrawBenchAesthetics (Laion Aesthtetics Predictor)5.4292Stable Diffusion 1.5
Image GenerationDrawBenchHuman Preference Alignement (HPSv2)0.2646Stable Diffusion 1.5
Image GenerationDrawBenchText Alignement (SentenceBERT)0.5997Stable Diffusion 1.5
Image GenerationLayoutBenchAP9.9LDM
Image GenerationCOCO-Stuff 256x256FID40.96LDM-4 (200steps)
Image GenerationCOCO-Stuff 256x256FID42.06LDM-8 (100steps)
Image ReconstructionUltra-High Resolution Image Reconstruction BenchmarkPSNR26.86SD-VAE (16x16)
Image ReconstructionUltra-High Resolution Image Reconstruction BenchmarkrFID1.07SD-VAE (16x16)
Text-to-Image GenerationCOCO (Common Objects in Context)FID12.63Latent Diffusion (LDM-KL-8-G)
Text-to-Image GenerationConceptual CaptionsFID17.01LDM-4
Text-to-Image GenerationDrawBenchAesthetics (Laion Aesthtetics Predictor)5.4292Stable Diffusion 1.5
Text-to-Image GenerationDrawBenchHuman Preference Alignement (HPSv2)0.2646Stable Diffusion 1.5
Text-to-Image GenerationDrawBenchText Alignement (SentenceBERT)0.5997Stable Diffusion 1.5
10-shot image generationCOCO (Common Objects in Context)FID12.63Latent Diffusion (LDM-KL-8-G)
10-shot image generationConceptual CaptionsFID17.01LDM-4
10-shot image generationDrawBenchAesthetics (Laion Aesthtetics Predictor)5.4292Stable Diffusion 1.5
10-shot image generationDrawBenchHuman Preference Alignement (HPSv2)0.2646Stable Diffusion 1.5
10-shot image generationDrawBenchText Alignement (SentenceBERT)0.5997Stable Diffusion 1.5
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID12.63Latent Diffusion (LDM-KL-8-G)
1 Image, 2*2 StitchiConceptual CaptionsFID17.01LDM-4
1 Image, 2*2 StitchiDrawBenchAesthetics (Laion Aesthtetics Predictor)5.4292Stable Diffusion 1.5
1 Image, 2*2 StitchiDrawBenchHuman Preference Alignement (HPSv2)0.2646Stable Diffusion 1.5
1 Image, 2*2 StitchiDrawBenchText Alignement (SentenceBERT)0.5997Stable Diffusion 1.5

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16