Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Generation | COCO (Common Objects in Context) | FID | 27.1 | CogView |
| Image Generation | COCO (Common Objects in Context) | FID-1 | 19.4 | CogView |
| Image Generation | COCO (Common Objects in Context) | FID-2 | 13.9 | CogView |
| Image Generation | COCO (Common Objects in Context) | FID-4 | 19.4 | CogView |
| Image Generation | COCO (Common Objects in Context) | FID-8 | 23.6 | CogView |
| Image Generation | COCO (Common Objects in Context) | Inception score | 18.2 | CogView |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID | 27.1 | CogView |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-1 | 19.4 | CogView |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-2 | 13.9 | CogView |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-4 | 19.4 | CogView |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-8 | 23.6 | CogView |
| Text-to-Image Generation | COCO (Common Objects in Context) | Inception score | 18.2 | CogView |
| 10-shot image generation | COCO (Common Objects in Context) | FID | 27.1 | CogView |
| 10-shot image generation | COCO (Common Objects in Context) | FID-1 | 19.4 | CogView |
| 10-shot image generation | COCO (Common Objects in Context) | FID-2 | 13.9 | CogView |
| 10-shot image generation | COCO (Common Objects in Context) | FID-4 | 19.4 | CogView |
| 10-shot image generation | COCO (Common Objects in Context) | FID-8 | 23.6 | CogView |
| 10-shot image generation | COCO (Common Objects in Context) | Inception score | 18.2 | CogView |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID | 27.1 | CogView |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-1 | 19.4 | CogView |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-2 | 13.9 | CogView |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-4 | 19.4 | CogView |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-8 | 23.6 | CogView |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Inception score | 18.2 | CogView |