LAFITE: Towards Language-Free Training for Text-to-Image Generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, Tong Sun

2021-11-27Text-to-Image Generation Text to Image Generation Image Generation

Abstract

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

Results

Task	Dataset	Metric	Value	Model
Image Generation	COCO (Common Objects in Context)	FID	8.12	Lafite
Image Generation	COCO (Common Objects in Context)	Inception score	32.34	Lafite
Image Generation	COCO (Common Objects in Context)	SOA-C	61.09	Lafite
Image Generation	COCO (Common Objects in Context)	FID	26.94	Lafite (zero-shot)
Image Generation	COCO (Common Objects in Context)	FID-1	22.97	Lafite (zero-shot)
Image Generation	COCO (Common Objects in Context)	FID-2	18.7	Lafite (zero-shot)
Image Generation	COCO (Common Objects in Context)	FID-4	15.72	Lafite (zero-shot)
Image Generation	COCO (Common Objects in Context)	FID-8	14.79	Lafite (zero-shot)
Image Generation	COCO (Common Objects in Context)	Inception score	26.02	Lafite (zero-shot)
Image Generation	CUB	FID	10.48	Lafite
Image Generation	CUB	Inception score	5.97	Lafite
Image Generation	Multi-Modal-CelebA-HQ	FID	12.54	Lafite
Text-to-Image Generation	COCO (Common Objects in Context)	FID	8.12	Lafite
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	32.34	Lafite
Text-to-Image Generation	COCO (Common Objects in Context)	SOA-C	61.09	Lafite
Text-to-Image Generation	COCO (Common Objects in Context)	FID	26.94	Lafite (zero-shot)
Text-to-Image Generation	COCO (Common Objects in Context)	FID-1	22.97	Lafite (zero-shot)
Text-to-Image Generation	COCO (Common Objects in Context)	FID-2	18.7	Lafite (zero-shot)
Text-to-Image Generation	COCO (Common Objects in Context)	FID-4	15.72	Lafite (zero-shot)
Text-to-Image Generation	COCO (Common Objects in Context)	FID-8	14.79	Lafite (zero-shot)
Text-to-Image Generation	COCO (Common Objects in Context)	Inception score	26.02	Lafite (zero-shot)
Text-to-Image Generation	CUB	FID	10.48	Lafite
Text-to-Image Generation	CUB	Inception score	5.97	Lafite
Text-to-Image Generation	Multi-Modal-CelebA-HQ	FID	12.54	Lafite
10-shot image generation	COCO (Common Objects in Context)	FID	8.12	Lafite
10-shot image generation	COCO (Common Objects in Context)	Inception score	32.34	Lafite
10-shot image generation	COCO (Common Objects in Context)	SOA-C	61.09	Lafite
10-shot image generation	COCO (Common Objects in Context)	FID	26.94	Lafite (zero-shot)
10-shot image generation	COCO (Common Objects in Context)	FID-1	22.97	Lafite (zero-shot)
10-shot image generation	COCO (Common Objects in Context)	FID-2	18.7	Lafite (zero-shot)
10-shot image generation	COCO (Common Objects in Context)	FID-4	15.72	Lafite (zero-shot)
10-shot image generation	COCO (Common Objects in Context)	FID-8	14.79	Lafite (zero-shot)
10-shot image generation	COCO (Common Objects in Context)	Inception score	26.02	Lafite (zero-shot)
10-shot image generation	Multi-Modal-CelebA-HQ	FID	12.54	Lafite
10-shot image generation	CUB	FID	10.48	Lafite
10-shot image generation	CUB	Inception score	5.97	Lafite
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	8.12	Lafite
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	32.34	Lafite
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	SOA-C	61.09	Lafite
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	26.94	Lafite (zero-shot)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-1	22.97	Lafite (zero-shot)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-2	18.7	Lafite (zero-shot)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-4	15.72	Lafite (zero-shot)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-8	14.79	Lafite (zero-shot)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	Inception score	26.02	Lafite (zero-shot)
1 Image, 2*2 Stitchi	Multi-Modal-CelebA-HQ	FID	12.54	Lafite
1 Image, 2*2 Stitchi	CUB	FID	10.48	Lafite
1 Image, 2*2 Stitchi	CUB	Inception score	5.97	Lafite

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Abstract

Results

Related Papers

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Abstract

Results

Related Papers