CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang

2022-04-28Super-Resolution Text-to-Image Generation Text to Image Generation Image Generation Language Modelling

Abstract

The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.

Results

Task	Dataset	Metric	Value	Model
Image Generation	COCO (Common Objects in Context)	FID	17.7	CogView2(6B, Finetuned)
Image Generation	COCO (Common Objects in Context)	FID	24	CogView2(6B, Finetuned)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	17.7	CogView2(6B, Finetuned)
Text-to-Image Generation	COCO (Common Objects in Context)	FID	24	CogView2(6B, Finetuned)
10-shot image generation	COCO (Common Objects in Context)	FID	17.7	CogView2(6B, Finetuned)
10-shot image generation	COCO (Common Objects in Context)	FID	24	CogView2(6B, Finetuned)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	17.7	CogView2(6B, Finetuned)
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	24	CogView2(6B, Finetuned)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution2025-07-17 fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17 FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17 A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17