Taming Transformers for High-Resolution Image Synthesis

Patrick Esser, Robin Rombach, Björn Ommer

2020-12-17CVPR 2021 1Text-to-Image Generation Vocal Bursts Intensity Prediction Image Reconstruction DeepFake Detection Image Outpainting Image Generation Image-to-Image Translation

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code(official)

Abstract

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .

Results

Task	Dataset	Metric	Value	Model
Image-to-Image Translation	COCO-Stuff Labels-to-Photos	FID	22.4	VQGAN+Transformer
Image-to-Image Translation	ADE20K Labels-to-Photos	FID	35.5	VQGAN+Transformer
Image Generation	FFHQ 256 x 256	FID	9.6	VQGAN+Transformer
Image Generation	CelebA 256x256	FID	10.2	VQGAN
Image Generation	CelebA-HQ 256x256	FID	10.2	VQGAN+Transformer
Image Generation	ImageNet 256x256	FID	5.2	VQGAN+Transformer (k=600, p=1.0, a=0.05)
Image Generation	ImageNet 256x256	FID	6.59	VQGAN+Transformer (k=mixed, p=1.0, a=0.005)
Image Generation	COCO-Stuff Labels-to-Photos	FID	22.4	VQGAN+Transformer
Image Generation	ADE20K Labels-to-Photos	FID	35.5	VQGAN+Transformer
Image Generation	Conceptual Captions	FID	28.86	VQ-GAN
Image Generation	LHQC	Block-FID	38.89	Taming
3D Reconstruction	FakeAVCeleb	AP	55	VQGAN
3D Reconstruction	FakeAVCeleb	ROC AUC	51.8	VQGAN
Image Reconstruction	Ultra-High Resolution Image Reconstruction Benchmark	PSNR	22.91	VQGAN (16x16)
Image Reconstruction	Ultra-High Resolution Image Reconstruction Benchmark	rFID	5.95	VQGAN (16x16)
Image Reconstruction	ImageNet	FID	3.64	Taming-VQGAN (16x16)
Image Reconstruction	ImageNet	LPIPS	0.177	Taming-VQGAN (16x16)
Image Reconstruction	ImageNet	PSNR	19.93	Taming-VQGAN (16x16)
Image Reconstruction	ImageNet	SSIM	0.542	Taming-VQGAN (16x16)
3D	FakeAVCeleb	AP	55	VQGAN
3D	FakeAVCeleb	ROC AUC	51.8	VQGAN
DeepFake Detection	FakeAVCeleb	AP	55	VQGAN
DeepFake Detection	FakeAVCeleb	ROC AUC	51.8	VQGAN
Text-to-Image Generation	Conceptual Captions	FID	28.86	VQ-GAN
Text-to-Image Generation	LHQC	Block-FID	38.89	Taming
Image Outpainting	LHQC	Block-FID (Right Extend)	22.53	Taming
Image Outpainting	LHQC	Block-FID (Down Extend)	26.38	Taming
10-shot image generation	Conceptual Captions	FID	28.86	VQ-GAN
10-shot image generation	LHQC	Block-FID	38.89	Taming
1 Image, 2*2 Stitchi	Conceptual Captions	FID	28.86	VQ-GAN
1 Image, 2*2 Stitchi	LHQC	Block-FID	38.89	Taming
3D Shape Reconstruction from Videos	FakeAVCeleb	AP	55	VQGAN
3D Shape Reconstruction from Videos	FakeAVCeleb	ROC AUC	51.8	VQGAN
1 Image, 2*2 Stitching	COCO-Stuff Labels-to-Photos	FID	22.4	VQGAN+Transformer
1 Image, 2*2 Stitching	ADE20K Labels-to-Photos	FID	35.5	VQGAN+Transformer

Taming Transformers for High-Resolution Image Synthesis

Abstract

Results

Related Papers

Taming Transformers for High-Resolution Image Synthesis

Abstract

Results

Related Papers