TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LAFITE: Towards Language-Free Training for Text-to-Image G...

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, Tong Sun

2021-11-27Text-to-Image GenerationText to Image GenerationImage Generation
PaperPDFCode(official)CodeCode

Abstract

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID8.12Lafite
Image GenerationCOCO (Common Objects in Context)Inception score32.34Lafite
Image GenerationCOCO (Common Objects in Context)SOA-C61.09Lafite
Image GenerationCOCO (Common Objects in Context)FID26.94Lafite (zero-shot)
Image GenerationCOCO (Common Objects in Context)FID-122.97Lafite (zero-shot)
Image GenerationCOCO (Common Objects in Context)FID-218.7Lafite (zero-shot)
Image GenerationCOCO (Common Objects in Context)FID-415.72Lafite (zero-shot)
Image GenerationCOCO (Common Objects in Context)FID-814.79Lafite (zero-shot)
Image GenerationCOCO (Common Objects in Context)Inception score26.02Lafite (zero-shot)
Image GenerationCUBFID10.48Lafite
Image GenerationCUBInception score5.97Lafite
Image GenerationMulti-Modal-CelebA-HQFID12.54Lafite
Text-to-Image GenerationCOCO (Common Objects in Context)FID8.12Lafite
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score32.34Lafite
Text-to-Image GenerationCOCO (Common Objects in Context)SOA-C61.09Lafite
Text-to-Image GenerationCOCO (Common Objects in Context)FID26.94Lafite (zero-shot)
Text-to-Image GenerationCOCO (Common Objects in Context)FID-122.97Lafite (zero-shot)
Text-to-Image GenerationCOCO (Common Objects in Context)FID-218.7Lafite (zero-shot)
Text-to-Image GenerationCOCO (Common Objects in Context)FID-415.72Lafite (zero-shot)
Text-to-Image GenerationCOCO (Common Objects in Context)FID-814.79Lafite (zero-shot)
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score26.02Lafite (zero-shot)
Text-to-Image GenerationCUBFID10.48Lafite
Text-to-Image GenerationCUBInception score5.97Lafite
Text-to-Image GenerationMulti-Modal-CelebA-HQFID12.54Lafite
10-shot image generationCOCO (Common Objects in Context)FID8.12Lafite
10-shot image generationCOCO (Common Objects in Context)Inception score32.34Lafite
10-shot image generationCOCO (Common Objects in Context)SOA-C61.09Lafite
10-shot image generationCOCO (Common Objects in Context)FID26.94Lafite (zero-shot)
10-shot image generationCOCO (Common Objects in Context)FID-122.97Lafite (zero-shot)
10-shot image generationCOCO (Common Objects in Context)FID-218.7Lafite (zero-shot)
10-shot image generationCOCO (Common Objects in Context)FID-415.72Lafite (zero-shot)
10-shot image generationCOCO (Common Objects in Context)FID-814.79Lafite (zero-shot)
10-shot image generationCOCO (Common Objects in Context)Inception score26.02Lafite (zero-shot)
10-shot image generationMulti-Modal-CelebA-HQFID12.54Lafite
10-shot image generationCUBFID10.48Lafite
10-shot image generationCUBInception score5.97Lafite
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID8.12Lafite
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score32.34Lafite
1 Image, 2*2 StitchiCOCO (Common Objects in Context)SOA-C61.09Lafite
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID26.94Lafite (zero-shot)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-122.97Lafite (zero-shot)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-218.7Lafite (zero-shot)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-415.72Lafite (zero-shot)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-814.79Lafite (zero-shot)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score26.02Lafite (zero-shot)
1 Image, 2*2 StitchiMulti-Modal-CelebA-HQFID12.54Lafite
1 Image, 2*2 StitchiCUBFID10.48Lafite
1 Image, 2*2 StitchiCUBInception score5.97Lafite

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16CharaConsist: Fine-Grained Consistent Character Generation2025-07-15CATVis: Context-Aware Thought Visualization2025-07-15