Patrick Esser, Robin Rombach, Björn Ommer
Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image-to-Image Translation | COCO-Stuff Labels-to-Photos | FID | 22.4 | VQGAN+Transformer |
| Image-to-Image Translation | ADE20K Labels-to-Photos | FID | 35.5 | VQGAN+Transformer |
| Image Generation | FFHQ 256 x 256 | FID | 9.6 | VQGAN+Transformer |
| Image Generation | CelebA 256x256 | FID | 10.2 | VQGAN |
| Image Generation | CelebA-HQ 256x256 | FID | 10.2 | VQGAN+Transformer |
| Image Generation | ImageNet 256x256 | FID | 5.2 | VQGAN+Transformer (k=600, p=1.0, a=0.05) |
| Image Generation | ImageNet 256x256 | FID | 6.59 | VQGAN+Transformer (k=mixed, p=1.0, a=0.005) |
| Image Generation | COCO-Stuff Labels-to-Photos | FID | 22.4 | VQGAN+Transformer |
| Image Generation | ADE20K Labels-to-Photos | FID | 35.5 | VQGAN+Transformer |
| Image Generation | Conceptual Captions | FID | 28.86 | VQ-GAN |
| Image Generation | LHQC | Block-FID | 38.89 | Taming |
| 3D Reconstruction | FakeAVCeleb | AP | 55 | VQGAN |
| 3D Reconstruction | FakeAVCeleb | ROC AUC | 51.8 | VQGAN |
| Image Reconstruction | Ultra-High Resolution Image Reconstruction Benchmark | PSNR | 22.91 | VQGAN (16x16) |
| Image Reconstruction | Ultra-High Resolution Image Reconstruction Benchmark | rFID | 5.95 | VQGAN (16x16) |
| Image Reconstruction | ImageNet | FID | 3.64 | Taming-VQGAN (16x16) |
| Image Reconstruction | ImageNet | LPIPS | 0.177 | Taming-VQGAN (16x16) |
| Image Reconstruction | ImageNet | PSNR | 19.93 | Taming-VQGAN (16x16) |
| Image Reconstruction | ImageNet | SSIM | 0.542 | Taming-VQGAN (16x16) |
| 3D | FakeAVCeleb | AP | 55 | VQGAN |
| 3D | FakeAVCeleb | ROC AUC | 51.8 | VQGAN |
| DeepFake Detection | FakeAVCeleb | AP | 55 | VQGAN |
| DeepFake Detection | FakeAVCeleb | ROC AUC | 51.8 | VQGAN |
| Text-to-Image Generation | Conceptual Captions | FID | 28.86 | VQ-GAN |
| Text-to-Image Generation | LHQC | Block-FID | 38.89 | Taming |
| Image Outpainting | LHQC | Block-FID (Right Extend) | 22.53 | Taming |
| Image Outpainting | LHQC | Block-FID (Down Extend) | 26.38 | Taming |
| 10-shot image generation | Conceptual Captions | FID | 28.86 | VQ-GAN |
| 10-shot image generation | LHQC | Block-FID | 38.89 | Taming |
| 1 Image, 2*2 Stitchi | Conceptual Captions | FID | 28.86 | VQ-GAN |
| 1 Image, 2*2 Stitchi | LHQC | Block-FID | 38.89 | Taming |
| 3D Shape Reconstruction from Videos | FakeAVCeleb | AP | 55 | VQGAN |
| 3D Shape Reconstruction from Videos | FakeAVCeleb | ROC AUC | 51.8 | VQGAN |
| 1 Image, 2*2 Stitching | COCO-Stuff Labels-to-Photos | FID | 22.4 | VQGAN+Transformer |
| 1 Image, 2*2 Stitching | ADE20K Labels-to-Photos | FID | 35.5 | VQGAN+Transformer |