Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Generation | COCO (Common Objects in Context) | FID | 13.86 | VQ-Diffusion-F |
| Image Generation | COCO (Common Objects in Context) | FID | 19.75 | VQ-Diffusion-B |
| Image Generation | Oxford 102 Flowers | FID | 14.1 | VQ-Diffusion-F |
| Image Generation | Oxford 102 Flowers | FID | 14.88 | VQ-Diffusion-B |
| Image Generation | Oxford 102 Flowers | FID | 14.95 | VQ-Diffusion-S |
| Image Generation | CUB | FID | 10.32 | VQ-Diffusion-F |
| Image Generation | CUB | FID | 11.94 | VQ-Diffusion-B |
| Image Generation | CUB | FID | 12.97 | VQ-Diffusion-S |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID | 13.86 | VQ-Diffusion-F |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID | 19.75 | VQ-Diffusion-B |
| Text-to-Image Generation | Oxford 102 Flowers | FID | 14.1 | VQ-Diffusion-F |
| Text-to-Image Generation | Oxford 102 Flowers | FID | 14.88 | VQ-Diffusion-B |
| Text-to-Image Generation | Oxford 102 Flowers | FID | 14.95 | VQ-Diffusion-S |
| Text-to-Image Generation | CUB | FID | 10.32 | VQ-Diffusion-F |
| Text-to-Image Generation | CUB | FID | 11.94 | VQ-Diffusion-B |
| Text-to-Image Generation | CUB | FID | 12.97 | VQ-Diffusion-S |
| 10-shot image generation | COCO (Common Objects in Context) | FID | 13.86 | VQ-Diffusion-F |
| 10-shot image generation | COCO (Common Objects in Context) | FID | 19.75 | VQ-Diffusion-B |
| 10-shot image generation | Oxford 102 Flowers | FID | 14.1 | VQ-Diffusion-F |
| 10-shot image generation | Oxford 102 Flowers | FID | 14.88 | VQ-Diffusion-B |
| 10-shot image generation | Oxford 102 Flowers | FID | 14.95 | VQ-Diffusion-S |
| 10-shot image generation | CUB | FID | 10.32 | VQ-Diffusion-F |
| 10-shot image generation | CUB | FID | 11.94 | VQ-Diffusion-B |
| 10-shot image generation | CUB | FID | 12.97 | VQ-Diffusion-S |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID | 13.86 | VQ-Diffusion-F |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID | 19.75 | VQ-Diffusion-B |
| 1 Image, 2*2 Stitchi | Oxford 102 Flowers | FID | 14.1 | VQ-Diffusion-F |
| 1 Image, 2*2 Stitchi | Oxford 102 Flowers | FID | 14.88 | VQ-Diffusion-B |
| 1 Image, 2*2 Stitchi | Oxford 102 Flowers | FID | 14.95 | VQ-Diffusion-S |
| 1 Image, 2*2 Stitchi | CUB | FID | 10.32 | VQ-Diffusion-F |
| 1 Image, 2*2 Stitchi | CUB | FID | 11.94 | VQ-Diffusion-B |
| 1 Image, 2*2 Stitchi | CUB | FID | 12.97 | VQ-Diffusion-S |