Autoregressive Image Generation using Residual Quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, Wook-Shin Han

2022-03-03CVPR 2022 1Text-to-Image Generation Quantization Image Reconstruction Image Generation Conditional Image Generation

Paper PDF Code Code(official)Code Code

Abstract

For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a 256$\times$256 image as 8$\times$8 resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.

Results

Task	Dataset	Metric	Value	Model
Image Generation	ImageNet 256x256	FID	3.83	RQ-Transformer
Image Generation	Conceptual Captions	FID	12.33	RQ-Transformer
Image Reconstruction	ImageNet	FID	1.83	RQ-VAE (8x8x16)
Text-to-Image Generation	Conceptual Captions	FID	12.33	RQ-Transformer
10-shot image generation	Conceptual Captions	FID	12.33	RQ-Transformer
1 Image, 2*2 Stitchi	Conceptual Captions	FID	12.33	RQ-Transformer

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04 An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC2025-07-18 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 Angle Estimation of a Single Source with Massive Uniform Circular Arrays2025-07-17 fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17 FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17 A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17