TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scaling Autoregressive Models for Content-Rich Text-to-Ima...

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, ZiRui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

2022-06-22Machine TranslationText-to-Image GenerationText to Image GenerationWorld KnowledgeImage Generation
PaperPDFCodeCode

Abstract

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

Results

TaskDatasetMetricValueModel
Image GenerationLAION COCOFID8.39Parti Finetuned
Image GenerationLAION COCOFID15.97Parti
Image GenerationCOCOFID3.22Parti Finetuned
Image GenerationCOCOFID7.23Parti
Text-to-Image GenerationLAION COCOFID8.39Parti Finetuned
Text-to-Image GenerationLAION COCOFID15.97Parti
Text-to-Image GenerationCOCOFID3.22Parti Finetuned
Text-to-Image GenerationCOCOFID7.23Parti
10-shot image generationCOCOFID3.22Parti Finetuned
10-shot image generationCOCOFID7.23Parti
10-shot image generationLAION COCOFID8.39Parti Finetuned
10-shot image generationLAION COCOFID15.97Parti
1 Image, 2*2 StitchiCOCOFID3.22Parti Finetuned
1 Image, 2*2 StitchiCOCOFID7.23Parti
1 Image, 2*2 StitchiLAION COCOFID8.39Parti Finetuned
1 Image, 2*2 StitchiLAION COCOFID15.97Parti

Related Papers

HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16