TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Generative Adversarial Transformers

Generative Adversarial Transformers

Drew A. Hudson, C. Lawrence Zitnick

2021-03-01Scene GenerationDisentanglementImage Generation
PaperPDFCode(official)Code

Abstract

We introduce the GANformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency. Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer.

Results

TaskDatasetMetricValueModel
Image GenerationCLEVRFID-5k-training-steps9.1679GANformer
Image GenerationCLEVRFID-5k-training-steps16.0534StyleGAN2
Image GenerationCLEVRFID-5k-training-steps25.0244GAN
Image GenerationCLEVRFID-5k-training-steps26.0433SAGAN
Image GenerationCLEVRFID-5k-training-steps32.6031VQGAN
Image GenerationFFHQClean-FID (70k)2.98StyleGAN2
Image GenerationFFHQClean-FID (70k)2.98StyleGAN2
Image GenerationFFHQFID-10k-training-steps10.8309StyleGAN2
Image GenerationFFHQFID-10k-training-steps12.8478GANsformer
Image GenerationFFHQFID-10k-training-steps13.1844GAN
Image GenerationFFHQFID-10k-training-steps16.2069SAGAN
Image GenerationFFHQFID-10k-training-steps63.1165VQGAN
Image GenerationCityscapesFID-10k-training-steps5.7589GANformer
Image GenerationCityscapesFID-10k-training-steps8.35StyleGAN2
Image GenerationCityscapesFID-10k-training-steps11.5652GAN
Image GenerationCityscapesFID-10k-training-steps12.8077SAGAN
Image GenerationCityscapesFID-10k-training-steps173.7971VQGAN
Image GenerationFFHQ 256 x 256FID7.42GANFormer
Image GenerationLSUN Bedroom 256 x 256FID-10k-training-steps6.5085GANformer
Image GenerationLSUN Bedroom 256 x 256FID-10k-training-steps11.5255StyleGAN2
Image GenerationLSUN Bedroom 256 x 256FID-10k-training-steps12.1567GAN
Image GenerationLSUN Bedroom 256 x 256FID-10k-training-steps14.0595SAGAN
Image GenerationLSUN Bedroom 256 x 256FID-10k-training-steps59.6333VQGAN

Related Papers

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models2025-07-18World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16