DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, Changsheng Xu

2020-08-13CVPR 2022 1Text-to-Image Generation Text Matching Image Generation

Abstract

Synthesizing high-quality realistic images from text descriptions is a challenging task. Existing text-to-image Generative Adversarial Networks generally employ a stacked architecture as the backbone yet still remain three flaws. First, the stacked architecture introduces the entanglements between generators of different image scales. Second, existing studies prefer to apply and fix extra networks in adversarial learning for text-image semantic consistency, which limits the supervision capability of these networks. Third, the cross-modal attention-based text-image fusion that widely adopted by previous works is limited on several special image scales because of the computational cost. To these ends, we propose a simpler but more effective Deep Fusion Generative Adversarial Networks (DF-GAN). To be specific, we propose: (i) a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators, (ii) a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output, which enhances the text-image semantic consistency without introducing extra networks, (iii) a novel deep text-image fusion block, which deepens the fusion process to make a full fusion between text and visual features. Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images and achieves better performance on widely used datasets.

Results

Task	Dataset	Metric	Value	Model
Image Generation	CUB	Inception score	4.86	DF-GAN
Image Generation	Multi-Modal-CelebA-HQ	Acc	17.3	DFGAN
Image Generation	Multi-Modal-CelebA-HQ	FID	137.6	DFGAN
Image Generation	Multi-Modal-CelebA-HQ	LPIPS	0.581	DFGAN
Image Generation	Multi-Modal-CelebA-HQ	Real	14.5	DFGAN
Text-to-Image Generation	CUB	Inception score	4.86	DF-GAN
Text-to-Image Generation	Multi-Modal-CelebA-HQ	Acc	17.3	DFGAN
Text-to-Image Generation	Multi-Modal-CelebA-HQ	FID	137.6	DFGAN
Text-to-Image Generation	Multi-Modal-CelebA-HQ	LPIPS	0.581	DFGAN
Text-to-Image Generation	Multi-Modal-CelebA-HQ	Real	14.5	DFGAN
10-shot image generation	Multi-Modal-CelebA-HQ	Acc	17.3	DFGAN
10-shot image generation	Multi-Modal-CelebA-HQ	FID	137.6	DFGAN
10-shot image generation	Multi-Modal-CelebA-HQ	LPIPS	0.581	DFGAN
10-shot image generation	Multi-Modal-CelebA-HQ	Real	14.5	DFGAN
10-shot image generation	CUB	Inception score	4.86	DF-GAN
1 Image, 2*2 Stitchi	Multi-Modal-CelebA-HQ	Acc	17.3	DFGAN
1 Image, 2*2 Stitchi	Multi-Modal-CelebA-HQ	FID	137.6	DFGAN
1 Image, 2*2 Stitchi	Multi-Modal-CelebA-HQ	LPIPS	0.581	DFGAN
1 Image, 2*2 Stitchi	Multi-Modal-CelebA-HQ	Real	14.5	DFGAN
1 Image, 2*2 Stitchi	CUB	Inception score	4.86	DF-GAN

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

Abstract

Results

Related Papers

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

Abstract

Results

Related Papers