Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen

2022-04-13Text-to-Image Generation Image Generation Conditional Image Generation

Paper PDF Code Code Code Code Code Code Code Code

Abstract

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

Results

Task	Dataset	Metric	Value	Model
Image Generation	COCO (Common Objects in Context)	FID	10.39	DALL-E 2
Text-to-Image Generation	COCO (Common Objects in Context)	FID	10.39	DALL-E 2
10-shot image generation	COCO (Common Objects in Context)	FID	10.39	DALL-E 2
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	10.39	DALL-E 2

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17 FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17 A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 FADE: Adversarial Concept Erasure in Flow Models2025-07-16 CharaConsist: Fine-Grained Consistent Character Generation2025-07-15 CATVis: Context-Aware Thought Visualization2025-07-15