TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vector Quantized Diffusion Model for Text-to-Image Synthesis

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo

2021-11-29CVPR 2022 1DenoisingText-to-Image GenerationText to Image GenerationImage Generation
PaperPDFCodeCode(official)

Abstract

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID13.86VQ-Diffusion-F
Image GenerationCOCO (Common Objects in Context)FID19.75VQ-Diffusion-B
Image GenerationOxford 102 FlowersFID14.1VQ-Diffusion-F
Image GenerationOxford 102 FlowersFID14.88VQ-Diffusion-B
Image GenerationOxford 102 FlowersFID14.95VQ-Diffusion-S
Image GenerationCUBFID10.32VQ-Diffusion-F
Image GenerationCUBFID11.94VQ-Diffusion-B
Image GenerationCUBFID12.97VQ-Diffusion-S
Text-to-Image GenerationCOCO (Common Objects in Context)FID13.86VQ-Diffusion-F
Text-to-Image GenerationCOCO (Common Objects in Context)FID19.75VQ-Diffusion-B
Text-to-Image GenerationOxford 102 FlowersFID14.1VQ-Diffusion-F
Text-to-Image GenerationOxford 102 FlowersFID14.88VQ-Diffusion-B
Text-to-Image GenerationOxford 102 FlowersFID14.95VQ-Diffusion-S
Text-to-Image GenerationCUBFID10.32VQ-Diffusion-F
Text-to-Image GenerationCUBFID11.94VQ-Diffusion-B
Text-to-Image GenerationCUBFID12.97VQ-Diffusion-S
10-shot image generationCOCO (Common Objects in Context)FID13.86VQ-Diffusion-F
10-shot image generationCOCO (Common Objects in Context)FID19.75VQ-Diffusion-B
10-shot image generationOxford 102 FlowersFID14.1VQ-Diffusion-F
10-shot image generationOxford 102 FlowersFID14.88VQ-Diffusion-B
10-shot image generationOxford 102 FlowersFID14.95VQ-Diffusion-S
10-shot image generationCUBFID10.32VQ-Diffusion-F
10-shot image generationCUBFID11.94VQ-Diffusion-B
10-shot image generationCUBFID12.97VQ-Diffusion-S
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID13.86VQ-Diffusion-F
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID19.75VQ-Diffusion-B
1 Image, 2*2 StitchiOxford 102 FlowersFID14.1VQ-Diffusion-F
1 Image, 2*2 StitchiOxford 102 FlowersFID14.88VQ-Diffusion-B
1 Image, 2*2 StitchiOxford 102 FlowersFID14.95VQ-Diffusion-S
1 Image, 2*2 StitchiCUBFID10.32VQ-Diffusion-F
1 Image, 2*2 StitchiCUBFID11.94VQ-Diffusion-B
1 Image, 2*2 StitchiCUBFID12.97VQ-Diffusion-S

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16FADE: Adversarial Concept Erasure in Flow Models2025-07-16