TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Taming Transformers for High-Resolution Image Synthesis

Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, Robin Rombach, Björn Ommer

2020-12-17CVPR 2021 1Text-to-Image GenerationVocal Bursts Intensity PredictionImage ReconstructionDeepFake DetectionImage OutpaintingImage GenerationImage-to-Image Translation
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)

Abstract

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .

Results

TaskDatasetMetricValueModel
Image-to-Image TranslationCOCO-Stuff Labels-to-PhotosFID22.4VQGAN+Transformer
Image-to-Image TranslationADE20K Labels-to-PhotosFID35.5VQGAN+Transformer
Image GenerationFFHQ 256 x 256FID9.6VQGAN+Transformer
Image GenerationCelebA 256x256FID10.2VQGAN
Image GenerationCelebA-HQ 256x256FID10.2VQGAN+Transformer
Image GenerationImageNet 256x256FID5.2VQGAN+Transformer (k=600, p=1.0, a=0.05)
Image GenerationImageNet 256x256FID6.59VQGAN+Transformer (k=mixed, p=1.0, a=0.005)
Image GenerationCOCO-Stuff Labels-to-PhotosFID22.4VQGAN+Transformer
Image GenerationADE20K Labels-to-PhotosFID35.5VQGAN+Transformer
Image GenerationConceptual CaptionsFID28.86VQ-GAN
Image GenerationLHQCBlock-FID38.89Taming
3D ReconstructionFakeAVCelebAP55VQGAN
3D ReconstructionFakeAVCelebROC AUC51.8VQGAN
Image ReconstructionUltra-High Resolution Image Reconstruction BenchmarkPSNR22.91VQGAN (16x16)
Image ReconstructionUltra-High Resolution Image Reconstruction BenchmarkrFID5.95VQGAN (16x16)
Image ReconstructionImageNetFID3.64Taming-VQGAN (16x16)
Image ReconstructionImageNetLPIPS0.177Taming-VQGAN (16x16)
Image ReconstructionImageNetPSNR19.93Taming-VQGAN (16x16)
Image ReconstructionImageNetSSIM0.542Taming-VQGAN (16x16)
3DFakeAVCelebAP55VQGAN
3DFakeAVCelebROC AUC51.8VQGAN
DeepFake DetectionFakeAVCelebAP55VQGAN
DeepFake DetectionFakeAVCelebROC AUC51.8VQGAN
Text-to-Image GenerationConceptual CaptionsFID28.86VQ-GAN
Text-to-Image GenerationLHQCBlock-FID38.89Taming
Image OutpaintingLHQCBlock-FID (Right Extend)22.53Taming
Image OutpaintingLHQCBlock-FID (Down Extend)26.38Taming
10-shot image generationConceptual CaptionsFID28.86VQ-GAN
10-shot image generationLHQCBlock-FID38.89Taming
1 Image, 2*2 StitchiConceptual CaptionsFID28.86VQ-GAN
1 Image, 2*2 StitchiLHQCBlock-FID38.89Taming
3D Shape Reconstruction from VideosFakeAVCelebAP55VQGAN
3D Shape Reconstruction from VideosFakeAVCelebROC AUC51.8VQGAN
1 Image, 2*2 StitchingCOCO-Stuff Labels-to-PhotosFID22.4VQGAN+Transformer
1 Image, 2*2 StitchingADE20K Labels-to-PhotosFID35.5VQGAN+Transformer

Related Papers

SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16CharaConsist: Fine-Grained Consistent Character Generation2025-07-15