Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, Tong Sun
One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Generation | COCO (Common Objects in Context) | FID | 8.12 | Lafite |
| Image Generation | COCO (Common Objects in Context) | Inception score | 32.34 | Lafite |
| Image Generation | COCO (Common Objects in Context) | SOA-C | 61.09 | Lafite |
| Image Generation | COCO (Common Objects in Context) | FID | 26.94 | Lafite (zero-shot) |
| Image Generation | COCO (Common Objects in Context) | FID-1 | 22.97 | Lafite (zero-shot) |
| Image Generation | COCO (Common Objects in Context) | FID-2 | 18.7 | Lafite (zero-shot) |
| Image Generation | COCO (Common Objects in Context) | FID-4 | 15.72 | Lafite (zero-shot) |
| Image Generation | COCO (Common Objects in Context) | FID-8 | 14.79 | Lafite (zero-shot) |
| Image Generation | COCO (Common Objects in Context) | Inception score | 26.02 | Lafite (zero-shot) |
| Image Generation | CUB | FID | 10.48 | Lafite |
| Image Generation | CUB | Inception score | 5.97 | Lafite |
| Image Generation | Multi-Modal-CelebA-HQ | FID | 12.54 | Lafite |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID | 8.12 | Lafite |
| Text-to-Image Generation | COCO (Common Objects in Context) | Inception score | 32.34 | Lafite |
| Text-to-Image Generation | COCO (Common Objects in Context) | SOA-C | 61.09 | Lafite |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID | 26.94 | Lafite (zero-shot) |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-1 | 22.97 | Lafite (zero-shot) |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-2 | 18.7 | Lafite (zero-shot) |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-4 | 15.72 | Lafite (zero-shot) |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-8 | 14.79 | Lafite (zero-shot) |
| Text-to-Image Generation | COCO (Common Objects in Context) | Inception score | 26.02 | Lafite (zero-shot) |
| Text-to-Image Generation | CUB | FID | 10.48 | Lafite |
| Text-to-Image Generation | CUB | Inception score | 5.97 | Lafite |
| Text-to-Image Generation | Multi-Modal-CelebA-HQ | FID | 12.54 | Lafite |
| 10-shot image generation | COCO (Common Objects in Context) | FID | 8.12 | Lafite |
| 10-shot image generation | COCO (Common Objects in Context) | Inception score | 32.34 | Lafite |
| 10-shot image generation | COCO (Common Objects in Context) | SOA-C | 61.09 | Lafite |
| 10-shot image generation | COCO (Common Objects in Context) | FID | 26.94 | Lafite (zero-shot) |
| 10-shot image generation | COCO (Common Objects in Context) | FID-1 | 22.97 | Lafite (zero-shot) |
| 10-shot image generation | COCO (Common Objects in Context) | FID-2 | 18.7 | Lafite (zero-shot) |
| 10-shot image generation | COCO (Common Objects in Context) | FID-4 | 15.72 | Lafite (zero-shot) |
| 10-shot image generation | COCO (Common Objects in Context) | FID-8 | 14.79 | Lafite (zero-shot) |
| 10-shot image generation | COCO (Common Objects in Context) | Inception score | 26.02 | Lafite (zero-shot) |
| 10-shot image generation | Multi-Modal-CelebA-HQ | FID | 12.54 | Lafite |
| 10-shot image generation | CUB | FID | 10.48 | Lafite |
| 10-shot image generation | CUB | Inception score | 5.97 | Lafite |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID | 8.12 | Lafite |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Inception score | 32.34 | Lafite |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | SOA-C | 61.09 | Lafite |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID | 26.94 | Lafite (zero-shot) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-1 | 22.97 | Lafite (zero-shot) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-2 | 18.7 | Lafite (zero-shot) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-4 | 15.72 | Lafite (zero-shot) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-8 | 14.79 | Lafite (zero-shot) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Inception score | 26.02 | Lafite (zero-shot) |
| 1 Image, 2*2 Stitchi | Multi-Modal-CelebA-HQ | FID | 12.54 | Lafite |
| 1 Image, 2*2 Stitchi | CUB | FID | 10.48 | Lafite |
| 1 Image, 2*2 Stitchi | CUB | Inception score | 5.97 | Lafite |